Open Tabs
- 061-exploring-data.2022-12-23T05-39-09-138Z.ipynb
- 062-clustering-two-features.ipynb
- 063-clustering-multiple-features.ipynb
- 064-interactive-dash-app.ipynb
- 065-assignment.ipynb
- 066-data-dictionary.ipynb
Kernels
Terminals
- .ipynb_checkpointsa month ago
- data3 months ago
- imagesa month ago
- 061-exploring-data.2022-12-23T05-39-09-138Z.ipynba month ago
- 061-exploring-data.ipynba month ago
- 062-clustering-two-features.ipynba month ago
- 063-clustering-multiple-features.ipynba month ago
- 064-interactive-dash-app.ipynba month ago
- 065-assignment.ipynba month ago
- 066-data-dictionary.ipynba month ago
- 1. Prepare Data
- 1.1. Import
- 2. Build Dashboard
- 2.1. Application Layout
- 2.2. Variance Bar Chart
- 2.3. K-means Slider and Metrics
- 2.4. PCA Scatter Plot
- 2.5. Application Deployment
- 061-exploring-data.2022-12-23T05-39-09-138Z.ipynb
- 062-clustering-two-features.ipynb
- 063-clustering-multiple-features.ipynb
- 064-interactive-dash-app.ipynb
- 065-assignment.ipynb
- 066-data-dictionary.ipynb
xxxxxxxxxx<font size="+3"><strong>6.1. Exploring the Data</strong></font>6.1. Exploring the Data
xxxxxxxxxxIn this project, we're going to work with data from the [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) (SCF). The SCF is a survey sponsored by the US Federal Reserve. It tracks financial, demographic, and opinion information about families in the United States. The survey is conducted every three years, and we'll work with an extract of the results from 2019.In this project, we're going to work with data from the Survey of Consumer Finances (SCF). The SCF is a survey sponsored by the US Federal Reserve. It tracks financial, demographic, and opinion information about families in the United States. The survey is conducted every three years, and we'll work with an extract of the results from 2019.
import matplotlib.pyplot as pltimport pandas as pdimport seaborn as snsimport wqet_graderfrom IPython.display import VimeoVideowqet_grader.init("Project 6 Assessment")VimeoVideo("710780578", h="43bb879d16", width=600)xxxxxxxxxx# Prepare Data1. Prepare Data¶
xxxxxxxxxx## Import1.1. Import¶
xxxxxxxxxxFirst, we need to load the data, which is stored in a compressed CSV file: `SCFP2019.csv.gz`. In the last project, you learned how to decompress files using `gzip` and the command line. However, pandas `read_csv` function can work with compressed files directly. First, we need to load the data, which is stored in a compressed CSV file: SCFP2019.csv.gz. In the last project, you learned how to decompress files using gzip and the command line. However, pandas read_csv function can work with compressed files directly.
VimeoVideo("710781788", h="efd2dda882", width=600)xxxxxxxxxx**Task 6.1.1:** Read the file `"data/SCFP2019.csv.gz"` into the DataFrame `df`.Task 6.1.1: Read the file "data/SCFP2019.csv.gz" into the DataFrame df.
df = ...print("df shape:", df.shape)df.head()xxxxxxxxxxOne of the first things you might notice here is that this dataset is HUGE — over 20,000 rows and 351 columns! SO MUCH DATA!!! We won't have time to explore all of the features in this dataset, but you can look in the [data dictionary](./066-data-dictionary.ipynb) for this project for details and links to the official [Code Book](https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbk.htm). For now, let's just say that this dataset tracks all sorts of behaviors relating to the ways households earn, save, and spend money in the United States.One of the first things you might notice here is that this dataset is HUGE — over 20,000 rows and 351 columns! SO MUCH DATA!!! We won't have time to explore all of the features in this dataset, but you can look in the data dictionary for this project for details and links to the official Code Book. For now, let's just say that this dataset tracks all sorts of behaviors relating to the ways households earn, save, and spend money in the United States.
For this project, we're going to focus on households that have "been turned down for credit or feared being denied credit in the past 5 years." These households are identified in the "TURNFEAR" column.
VimeoVideo("710783015", h="c24ce96aab", width=600)xxxxxxxxxx**Task 6.1.2:** Use a`mask` to subset create `df` to only households that have been turned down or feared being turned down for credit (`"TURNFEAR" == 1`). Assign this subset to the variable name `df_fear`.Task 6.1.2: Use amask to subset create df to only households that have been turned down or feared being turned down for credit ("TURNFEAR" == 1). Assign this subset to the variable name df_fear.
mask = ...df_fear = ...print("df_fear shape:", df_fear.shape)df_fear.head()xxxxxxxxxx## Explore1.2. Explore¶
xxxxxxxxxx### Age1.2.1. Age¶
xxxxxxxxxxNow that we have our subset, let's explore the characteristics of this group. One of the features is age group (`"AGECL"`).Now that we have our subset, let's explore the characteristics of this group. One of the features is age group ("AGECL").
VimeoVideo("710784794", h="71b10e363d", width=600)xxxxxxxxxx**Task 6.1.3:** Create a list `age_groups` with the unique values in the `"AGECL"` column. Then review the entry for `"AGECL"` in the [Code Book](https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbkfx0.htm) to determine what the values represent.Task 6.1.3: Create a list age_groups with the unique values in the "AGECL" column. Then review the entry for "AGECL" in the Code Book to determine what the values represent.
age_groups = ...print("Age Groups:", age_groups)xxxxxxxxxxLooking at the Code Book we can see that `"AGECL"` represents categorical data, even though the values in the column are numeric.Looking at the Code Book we can see that "AGECL" represents categorical data, even though the values in the column are numeric.

This simplifies data storage, but it's not very human-readable. So before we create a visualization, let's create a version of this column that uses the actual group names.
xxxxxxxxxxVimeoVideo("710785566", h="f0fafd3a29", width=600)xxxxxxxxxx**Task 6.1.4:** Create a Series `agecl` that contains the observations from `"AGECL"` using the true group names. Task 6.1.4: Create a Series agecl that contains the observations from "AGECL" using the true group names.
xxxxxxxxxxagecl_dict = { 1: "Under 35", 2: "35-44", 3: "45-54", 4: "55-64", 5: "65-74", 6: "75 or Older",}age_cl = ...age_cl.head()xxxxxxxxxxNow that we have better labels, let's make a bar chart and see the age distribution of our group.Now that we have better labels, let's make a bar chart and see the age distribution of our group.
xxxxxxxxxxVimeoVideo("710840376", h="d43825c14b", width=600)xxxxxxxxxx**Task 6.1.5:** Create a bar chart showing the value counts from `age_cl`. Be sure to label the x-axis `"Age Group"`, the y-axis `"Frequency (count)"`, and use the title `"Credit Fearful: Age Groups"`.Task 6.1.5: Create a bar chart showing the value counts from age_cl. Be sure to label the x-axis "Age Group", the y-axis "Frequency (count)", and use the title "Credit Fearful: Age Groups".
xxxxxxxxxxage_cl_value_counts = ...# Bar plot of `age_cl_value_counts`xxxxxxxxxxYou might have noticed that by creating their own age groups, the authors of the survey have basically made a histogram for us comprised of 6 bins. Our chart is telling us that many of the people who fear being denied credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to look inside those values to get a more granular understanding of the data.You might have noticed that by creating their own age groups, the authors of the survey have basically made a histogram for us comprised of 6 bins. Our chart is telling us that many of the people who fear being denied credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to look inside those values to get a more granular understanding of the data.
To do that, we'll need to look at a different variable: "AGE". Whereas "AGECL" was a categorical variable, "AGE" is continuous, so we can use it to make a histogram of our own.
xxxxxxxxxxVimeoVideo("710841580", h="a146a24e5c", width=600)xxxxxxxxxx**Task 6.1.6:** Create a histogram of the `"AGE"` column with 10 bins. Be sure to label the x-axis `"Age"`, the y-axis `"Frequency (count)"`, and use the title `"Credit Fearful: Age Distribution"`. Task 6.1.6: Create a histogram of the "AGE" column with 10 bins. Be sure to label the x-axis "Age", the y-axis "Frequency (count)", and use the title "Credit Fearful: Age Distribution".
xxxxxxxxxx# Plot histogram of "AGE"xxxxxxxxxxIt looks like younger people are still more concerned about being able to secure a loan than older people, but the people who are *most* concerned seem to be between 30 and 40. It looks like younger people are still more concerned about being able to secure a loan than older people, but the people who are most concerned seem to be between 30 and 40.
xxxxxxxxxx### Race1.2.2. Race¶
xxxxxxxxxxNow that we have an understanding of how age relates to our outcome of interest, let's try some other possibilities, starting with race. If we look at the [Code Book](https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbk0001.htm#RACE) for `"RACE"`, we can see that there are 4 categories.Now that we have an understanding of how age relates to our outcome of interest, let's try some other possibilities, starting with race. If we look at the Code Book for "RACE", we can see that there are 4 categories.

Note that there's no 4 category here. If a value for 4 did exist, it would be reasonable to assign it to "Asian American / Pacific Islander" — a group that doesn't seem to be represented in the dataset. This is a strange omission, but you'll often find that large public datasets have these sorts of issues. The important thing is to always read the data dictionary carefully. In this case, remember that this dataset doesn't provide a complete picture of race in America — something that you'd have to explain to anyone interested in your analysis.
xxxxxxxxxxVimeoVideo("710842177", h="8d8354e091", width=600)xxxxxxxxxx**Task 6.1.7:** Create a horizontal bar chart showing the normalized value counts for `"RACE"`. In your chart, you should replace the numerical values with the true group names. Be sure to label the x-axis `"Frequency (%)"`, the y-axis `"Race"`, and use the title `"Credit Fearful: Racial Groups"`. Finally, set the `xlim` for this plot to `(0,1)`.Task 6.1.7: Create a horizontal bar chart showing the normalized value counts for "RACE". In your chart, you should replace the numerical values with the true group names. Be sure to label the x-axis "Frequency (%)", the y-axis "Race", and use the title "Credit Fearful: Racial Groups". Finally, set the xlim for this plot to (0,1).
xxxxxxxxxxrace_dict = { 1: "White/Non-Hispanic", 2: "Black/African-American", 3: "Hispanic", 5: "Other",}race = ...race_value_counts = ...# Create bar chart of race_value_countsplt.xlim((0, 1))plt.xlabel("Frequency (%)")plt.ylabel("Race")plt.title("Credit Fearful: Racial Groups");xxxxxxxxxxThis suggests that White/Non-Hispanic people worry more about being denied credit, but thinking critically about what we're seeing, that might be because there are more White/Non-Hispanic in the population of the United States than there are other racial groups, and the sample for this survey was specifically drawn to be representative of the population as a whole.This suggests that White/Non-Hispanic people worry more about being denied credit, but thinking critically about what we're seeing, that might be because there are more White/Non-Hispanic in the population of the United States than there are other racial groups, and the sample for this survey was specifically drawn to be representative of the population as a whole.
xxxxxxxxxxVimeoVideo("710844376", h="8e1fdf92ef", width=600)xxxxxxxxxx**Task 6.1.8:** Recreate the horizontal bar chart you just made, but this time use the entire dataset `df` instead of the subset `df_fear`. The title of this plot should be `"SCF Respondents: Racial Groups"`Task 6.1.8: Recreate the horizontal bar chart you just made, but this time use the entire dataset df instead of the subset df_fear. The title of this plot should be "SCF Respondents: Racial Groups"
xxxxxxxxxxrace = ...race_value_counts = ...# Create bar chart of race_value_countsplt.xlim((0, 1))plt.xlabel("Frequency (%)")plt.ylabel("Race")plt.title("SCF Respondents: Racial Groups");xxxxxxxxxxHow does this second bar chart change our perception of the first one? On the one hand, we can see that White Non-Hispanics account for around 70% of whole dataset, but only 54% of credit fearful respondents. On the other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of credit fearful respondents. In other words, Black and Hispanic households are actually *more* likely to be in the credit fearful group. How does this second bar chart change our perception of the first one? On the one hand, we can see that White Non-Hispanics account for around 70% of whole dataset, but only 54% of credit fearful respondents. On the other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of credit fearful respondents. In other words, Black and Hispanic households are actually more likely to be in the credit fearful group.
xxxxxxxxxx<div class="alert alert-block alert-warning">xxxxxxxxxx### Income1.2.3. Income¶
xxxxxxxxxxWhat about income level? Are people with lower incomes concerned about being denied credit, or is that something people with more money worry about? In order to answer that question, we'll need to again compare the entire dataset with our subgroup using the `"INCCAT"` feature, which captures income percentile groups. This time, though, we'll make a single, side-by-side bar chart.What about income level? Are people with lower incomes concerned about being denied credit, or is that something people with more money worry about? In order to answer that question, we'll need to again compare the entire dataset with our subgroup using the "INCCAT" feature, which captures income percentile groups. This time, though, we'll make a single, side-by-side bar chart.

xxxxxxxxxxVimeoVideo("710849451", h="34a367a3f9", width=600)xxxxxxxxxx**Task 6.1.9:** Create a DataFrame `df_inccat` that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset. Your final DataFrame should look something like this:Task 6.1.9: Create a DataFrame df_inccat that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset. Your final DataFrame should look something like this:
TURNFEAR INCCAT frequency
0 0 90-100 0.297296
1 0 60-79.9 0.174841
2 0 40-59.9 0.143146
3 0 0-20 0.140343
4 0 21-39.9 0.135933
5 0 80-89.9 0.108441
6 1 0-20 0.288125
7 1 21-39.9 0.256327
8 1 40-59.9 0.228856
9 1 60-79.9 0.132598
10 1 90-100 0.048886
11 1 80-89.9 0.045209
xxxxxxxxxxinccat_dict = { 1: "0-20", 2: "21-39.9", 3: "40-59.9", 4: "60-79.9", 5: "80-89.9", 6: "90-100",}df_inccat = ...df_inccatxxxxxxxxxxVimeoVideo("710852691", h="3dcbf24a68", width=600)xxxxxxxxxx**Task 6.1.10:** Using seaborn, create a side-by-side bar chart of `df_inccat`. Set `hue` to `"TURNFEAR"`, and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis `"Income Category"`, the y-axis `"Frequency (%)"`, and use the title `"Income Distribution: Credit Fearful vs. Non-fearful"`.Task 6.1.10: Using seaborn, create a side-by-side bar chart of df_inccat. Set hue to "TURNFEAR", and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis "Income Category", the y-axis "Frequency (%)", and use the title "Income Distribution: Credit Fearful vs. Non-fearful".
xxxxxxxxxx# Create bar chart of `df_inccat`plt.xlabel("Income Category")plt.ylabel("Frequency (%)")plt.title("Income Distribution: Credit Fearful vs. Non-fearful");xxxxxxxxxxComparing the income categories across the fearful and non-fearful groups, we can see that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes. Comparing the income categories across the fearful and non-fearful groups, we can see that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes.
xxxxxxxxxxSo, based on all this, what do we know? Among the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. That makes sense, right? Young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.So, based on all this, what do we know? Among the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. That makes sense, right? Young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.
xxxxxxxxxx### Assets1.2.4. Assets¶
xxxxxxxxxxNot all the data is demographic, though. If you were working for a bank, you would probably care less about how old the people are, and more about their ability to carry more debt. If we were going to build a model for that, we'd want to establish some relationships among the variables, and making some correlation matrices is a good place to start.Not all the data is demographic, though. If you were working for a bank, you would probably care less about how old the people are, and more about their ability to carry more debt. If we were going to build a model for that, we'd want to establish some relationships among the variables, and making some correlation matrices is a good place to start.
First, let's zoom out a little bit. We've been looking at only the people who answered "yes" when the survey asked about "TURNFEAR", but what if we looked at everyone instead? To begin with, let's bring in a clear dataset and run a single correlation.
xxxxxxxxxxVimeoVideo("710856200", h="7b06e8b7f2", width=600)xxxxxxxxxx**Task 6.1.11:** Calculate the correlation coefficient for `"ASSET"` and `"HOUSES"` in the whole dataset `df`.Task 6.1.11: Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole dataset df.
xxxxxxxxxxasset_house_corr = ...print("SCF: Asset Houses Correlation:", asset_house_corr)xxxxxxxxxxThat's a moderate positive correlation, which we would probably expect, right? For many Americans, the value of their primary residence makes up most of the value of their total assets. What about the people in our `TURNFEAR` subset, though? Let's run that correlation to see if there's a difference.That's a moderate positive correlation, which we would probably expect, right? For many Americans, the value of their primary residence makes up most of the value of their total assets. What about the people in our TURNFEAR subset, though? Let's run that correlation to see if there's a difference.
xxxxxxxxxxVimeoVideo("710857088", h="33b8f810fb", width=600)xxxxxxxxxx**Task 6.1.12:** Calculate the correlation coefficient for `"ASSET"` and `"HOUSES"` in the whole credit-fearful subset `df_fear`.Task 6.1.12: Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole credit-fearful subset df_fear.
xxxxxxxxxxasset_house_corr = ...print("Credit Fearful: Asset Houses Correlation:", asset_house_corr)xxxxxxxxxxAha! They're different! It's still only a moderate positive correlation, but the relationship between the total value of assets and the value of the primary residence is stronger for our `TURNFEAR` group than it is for the population as a whole. Aha! They're different! It's still only a moderate positive correlation, but the relationship between the total value of assets and the value of the primary residence is stronger for our TURNFEAR group than it is for the population as a whole.
Let's make correlation matrices using the rest of the data for both df and df_fear and see if the differences persist. Here, we'll look at only 5 features: "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC".
xxxxxxxxxxVimeoVideo("710857545", h="c67691d13e", width=600)xxxxxxxxxx**Task 6.1.13:** Make a correlation matrix using `df`, considering only the columns `"ASSET"`, `"HOUSES"`, `"INCOME"`, `"DEBT"`, and `"EDUC"`.Task 6.1.13: Make a correlation matrix using df, considering only the columns "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC".
xxxxxxxxxxcols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]corr = ...corr.style.background_gradient(axis=None)xxxxxxxxxxwqet_grader.grade("Project 6 Assessment", "Task 6.1.13", corr)xxxxxxxxxxVimeoVideo("710858210", h="b679fd1fa5", width=600)xxxxxxxxxx**Task 6.1.14:** Make a correlation matrix using `df_fear`.Task 6.1.14: Make a correlation matrix using df_fear.
xxxxxxxxxxcorr = ...corr.style.background_gradient(axis=None)xxxxxxxxxxWhoa! There are some pretty important differences here! The relationship between `"DEBT"` and `"HOUSES"` is positive for both datasets, but while the coefficient for `df` is fairly weak at 0.26, the same number for `df_fear` is 0.96. Whoa! There are some pretty important differences here! The relationship between "DEBT" and "HOUSES" is positive for both datasets, but while the coefficient for df is fairly weak at 0.26, the same number for df_fear is 0.96.
Remember, the closer a correlation coefficient is to 1.0, the more exactly they correspond. In this case, that means the value of the primary residence and the total debt held by the household is getting pretty close to being the same. This suggests that the main source of debt being carried by our "TURNFEAR" folks is their primary residence, which, again, is an intuitive finding.
"DEBT" and "ASSET" share a similarly striking difference, as do "EDUC" and "DEBT" which, while not as extreme a contrast as the other, is still big enough to catch the interest of our hypothetical banker.
Let's make some visualizations to show these relationships graphically.
xxxxxxxxxx### Education1.2.5. Education¶
xxxxxxxxxxFirst, let's start with education levels `"EDUC"`, comparing credit fearful and non-credit fearful groups.First, let's start with education levels "EDUC", comparing credit fearful and non-credit fearful groups.
xxxxxxxxxx
xxxxxxxxxxVimeoVideo("710858769", h="2e6596cd4b", width=600)xxxxxxxxxx**Task 6.1.15:** Create a DataFrame `df_educ` that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset. This will be similar in format to `df_inccat`, but focus on education. **Note** that you don't need to replace the numerical values in `"EDUC"` with the true labels.Task 6.1.15: Create a DataFrame df_educ that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset. This will be similar in format to df_inccat, but focus on education. Note that you don't need to replace the numerical values in "EDUC" with the true labels.
TURNFEAR EDUC frequency
0 0 12 0.257481
1 0 8 0.192029
2 0 13 0.149823
3 0 9 0.129833
4 0 14 0.096117
5 0 10 0.051150
...
25 1 5 0.015358
26 1 2 0.012979
27 1 3 0.011897
28 1 1 0.005408
29 1 -1 0.003245
xxxxxxxxxxdf_educ = ...df_educ.head()xxxxxxxxxxVimeoVideo("710861978", h="81349c4b6a", width=600)xxxxxxxxxx**Task 6.1.16:** Using seaborn, create a side-by-side bar chart of `df_educ`. Set `hue` to `"TURNFEAR"`, and make sure that the education categories are in the correct order along the x-axis. Label to the x-axis `"Education Level"`, the y-axis `"Frequency (%)"`, and use the title `"Educational Attainment: Credit Fearful vs. Non-fearful"`.Task 6.1.16: Using seaborn, create a side-by-side bar chart of df_educ. Set hue to "TURNFEAR", and make sure that the education categories are in the correct order along the x-axis. Label to the x-axis "Education Level", the y-axis "Frequency (%)", and use the title "Educational Attainment: Credit Fearful vs. Non-fearful".
xxxxxxxxxx# Create bar chart of `df_educ`plt.xlabel("Education Level")plt.ylabel("Frequency (%)")plt.title("Educational Attainment: Credit Fearful vs. Non-fearful");xxxxxxxxxxIn this plot, we can see that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful.In this plot, we can see that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful.
xxxxxxxxxx### Debt1.2.6. Debt¶
xxxxxxxxxxLet's keep going with some scatter plots that look at debt.Let's keep going with some scatter plots that look at debt.
xxxxxxxxxxVimeoVideo("710862939", h="0f6e0fc201", width=600)xxxxxxxxxx**Task 6.1.17:** Use `df` to make a scatter plot showing the relationship between `DEBT` and `ASSET`.Task 6.1.17: Use df to make a scatter plot showing the relationship between DEBT and ASSET.
xxxxxxxxxx# Create scatter plot of ASSET vs DEBT, dfxxxxxxxxxxVimeoVideo("710864442", h="2428f1c168", width=600)xxxxxxxxxx**Task 6.1.18:** Use `df_fear` to make a scatter plot showing the relationship between `DEBT` and `ASSET`.Task 6.1.18: Use df_fear to make a scatter plot showing the relationship between DEBT and ASSET.
xxxxxxxxxx# Create scatter plot of ASSET vs DEBT, df_fearxxxxxxxxxxYou can see relationship in our `df_fear` graph is flatter than the one in our `df` graph, but they clearly are different. You can see relationship in our df_fear graph is flatter than the one in our df graph, but they clearly are different.
xxxxxxxxxxLet's end with the most striking difference from our matrices, and make some scatter plots showing the difference between `HOUSES` and `DEBT`.Let's end with the most striking difference from our matrices, and make some scatter plots showing the difference between HOUSES and DEBT.
xxxxxxxxxxVimeoVideo("710865281", h="2e9fc0d9b9", width=600)xxxxxxxxxx**Task 6.1.19:** Use `df` to make a scatter plot showing the relationship between `HOUSES` and `DEBT`.Task 6.1.19: Use df to make a scatter plot showing the relationship between HOUSES and DEBT.
xxxxxxxxxx# Create scatter plot of HOUSES vs DEBT, dfxxxxxxxxxxAnd make the same scatter plot using `df_fear`. And make the same scatter plot using df_fear.
xxxxxxxxxxVimeoVideo("710870286", h="3cd177965a", width=600)xxxxxxxxxx**Task 6.1.20:** Use `df_fear` to make a scatter plot showing the relationship between `HOUSES` and `DEBT`.Task 6.1.20: Use df_fear to make a scatter plot showing the relationship between HOUSES and DEBT.
xxxxxxxxxx# Create scatter plot of HOUSES vs DEBT, df_fearxxxxxxxxxxThe outliers make it a little difficult to see the difference between these two plots, but the relationship is clear enough: our `df_fear` graph shows an almost perfect linear relationship, while our `df` graph shows something a little more muddled. You might also notice that the datapoints on the `df_fear` graph form several little groups. Those are called "clusters," and we'll be talking more about how to analyze clustered data in the next lesson.The outliers make it a little difficult to see the difference between these two plots, but the relationship is clear enough: our df_fear graph shows an almost perfect linear relationship, while our df graph shows something a little more muddled. You might also notice that the datapoints on the df_fear graph form several little groups. Those are called "clusters," and we'll be talking more about how to analyze clustered data in the next lesson.
xxxxxxxxxx---Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>6.2. Clustering with Two Features</strong></font>6.2. Clustering with Two Features
xxxxxxxxxxIn the previous lesson, you explored data from the [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) (SCF), paying special attention to households that have been turned down for credit or feared being denied credit. In this lesson, we'll build a model to segment those households into distinct clusters, and examine the differences between those clusters. In the previous lesson, you explored data from the Survey of Consumer Finances (SCF), paying special attention to households that have been turned down for credit or feared being denied credit. In this lesson, we'll build a model to segment those households into distinct clusters, and examine the differences between those clusters.
xxxxxxxxxximport matplotlib.pyplot as pltimport pandas as pdimport seaborn as snsimport wqet_graderfrom IPython.display import VimeoVideofrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scorefrom teaching_tools.widgets import ClusterWidget, SCFClusterWidgetwqet_grader.init("Project 6 Assessment")xxxxxxxxxxVimeoVideo("713919442", h="7b4cbc1495", width=600)xxxxxxxxxx# Prepare Data1. Prepare Data¶
xxxxxxxxxx## Import1.1. Import¶
Just like always, we need to begin by bringing our data into the project. We spent some time in the previous lesson working with a subset of the larger SCF dataset called "TURNFEAR". Let's start with that.
xxxxxxxxxxVimeoVideo("713919411", h="fd4fae4013", width=600)xxxxxxxxxx**Task 6.2.1:** Create a `wrangle` function that takes a path of a CSV file as input, reads the file into a DataFrame, subsets the data to households that have been turned down for credit or feared being denied credit in the past 5 years (see `"TURNFEAR"`), and returns the subset DataFrame. Task 6.2.1: Create a wrangle function that takes a path of a CSV file as input, reads the file into a DataFrame, subsets the data to households that have been turned down for credit or feared being denied credit in the past 5 years (see "TURNFEAR"), and returns the subset DataFrame.
xxxxxxxxxxdef wrangle(filepath): df = pd.read_csv(filepath) mask = df["TURNFEAR"]==1 df=df[mask] return dfxxxxxxxxxxAnd now that we've got that taken care of, we'll import the data and see what we've got.And now that we've got that taken care of, we'll import the data and see what we've got.
Task 6.2.2: Use your wrangle function to read the file SCFP2019.csv.gz into a DataFrame named df.
xxxxxxxxxxdf = wrangle("data/SCFP2019.csv.gz")print(df.shape)df.head()(4623, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2 | 21 | 3790.476607 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 6 | 2 | 22 | 3798.868505 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 3 | 2 | 2 |
| 7 | 2 | 23 | 3799.468393 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 8 | 2 | 24 | 3788.076005 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 9 | 2 | 25 | 3793.066589 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
5 rows × 351 columns
xxxxxxxxxx## Explore1.2. Explore¶
xxxxxxxxxxWe looked at a lot of different features of the `"TURNFEAR"` subset in the last lesson, and the last thing we looked at was the relationship between real estate and debt. To refresh our memory on what that relationship looked like, let's make that graph again.We looked at a lot of different features of the "TURNFEAR" subset in the last lesson, and the last thing we looked at was the relationship between real estate and debt. To refresh our memory on what that relationship looked like, let's make that graph again.
xxxxxxxxxxVimeoVideo("713919351", h="55dc979d55", width=600)xxxxxxxxxx**Task 6.2.3:** Create a scatter plot of that shows the total value of primary residence of a household (`"HOUSES"`) as a function of the total value of household debt (`"DEBT"`). Be sure to label your x-axis as `"Household Debt"`, your y-axis as `"Home Value"`, and use the title `"Credit Fearful: Home Value vs. Household Debt"`.Task 6.2.3: Create a scatter plot of that shows the total value of primary residence of a household ("HOUSES") as a function of the total value of household debt ("DEBT"). Be sure to label your x-axis as "Household Debt", your y-axis as "Home Value", and use the title "Credit Fearful: Home Value vs. Household Debt".
xxxxxxxxxx# Plot "HOUSES" vs "DEBT"sns.scatterplot(x=df["DEBT"]/1e6, y=df["HOUSES"]/1e6)plt.xlabel("Household Debt [$1M]")plt.ylabel("Home Value [$1M]")plt.title("Credit Fearful: Home Value vs. Household Debt");xxxxxxxxxxRemember that graph and its clusters? Let's get a little deeper into it.Remember that graph and its clusters? Let's get a little deeper into it.
xxxxxxxxxx## Split1.3. Split¶
xxxxxxxxxxWe need to split our data, but we're not going to need target vector or a test set this time around. That's because the model we'll be building involves *unsupervised* learning. It's called *unsupervised* because the model doesn't try to map input to a st of labels or targets that already exist. It's kind of like how humans learn new skills, in that we don't always have models to copy. Sometimes, we just try out something and see what happens. Keep in mind that this doesn't make these models any less useful, it just makes them different.We need to split our data, but we're not going to need target vector or a test set this time around. That's because the model we'll be building involves unsupervised learning. It's called unsupervised because the model doesn't try to map input to a st of labels or targets that already exist. It's kind of like how humans learn new skills, in that we don't always have models to copy. Sometimes, we just try out something and see what happens. Keep in mind that this doesn't make these models any less useful, it just makes them different.
So, keeping that in mind, let's do the split.
xxxxxxxxxxVimeoVideo("713919336", h="775867f48a", width=600)xxxxxxxxxx**Task 6.2.4:** Create the feature matrix `X`. It should contain two features only: `"DEBT"` and `"HOUSES"`.Task 6.2.4: Create the feature matrix X. It should contain two features only: "DEBT" and "HOUSES".
xxxxxxxxxxX = df[["HOUSES", "DEBT"]]print(X.shape)X.head()(4623, 2)
| HOUSES | DEBT | |
|---|---|---|
| 5 | 0.0 | 12200.0 |
| 6 | 0.0 | 12600.0 |
| 7 | 0.0 | 15300.0 |
| 8 | 0.0 | 14100.0 |
| 9 | 0.0 | 15400.0 |
xxxxxxxxxx# Build Model2. Build Model¶
xxxxxxxxxxBefore we start building the model, let's take a second to talk about something called `KMeans`. Before we start building the model, let's take a second to talk about something called KMeans.
Take another look at the scatter plot we made at the beginning of this lesson. Remember how the datapoints form little clusters? It turns out we can use an algorithm that partitions the dataset into smaller groups.
Let's take a look at how those things work together.
xxxxxxxxxxVimeoVideo("713919214", h="028502efe7", width=600)xxxxxxxxxx**Task 6.2.5:** Run the cell below to display the `ClusterWidget`.Task 6.2.5: Run the cell below to display the ClusterWidget.
xxxxxxxxxxcw = ClusterWidget(n_clusters=3)cw.show()VBox(children=(IntSlider(value=0, continuous_update=False, description='Step:', max=10), Output(layout=Layout(…
xxxxxxxxxxTake a second and run slowly through all the positions on the slider. At the first position, there's whole bunch of gray datapoints, and if you look carefully, you'll see there are also three stars. Those stars are the **centroids**. At first, their position is set randomly. If you move the slider one more position to the right, you'll see all the gray points change colors that correspond to three clusters.Take a second and run slowly through all the positions on the slider. At the first position, there's whole bunch of gray datapoints, and if you look carefully, you'll see there are also three stars. Those stars are the centroids. At first, their position is set randomly. If you move the slider one more position to the right, you'll see all the gray points change colors that correspond to three clusters.
Since a centroid represents the mean value of all the data in the cluster, we would expect it to fall in the center of whatever cluster it's in. That's what will happen if you move the slider one more position to the right. See how the centroids moved?
Aha! But since they moved, the datapoints might not be in the right clusters anymore. Move the slider again, and you'll see the data points redistribute themselves to better reflect the new position of the centroids. The new clusters mean that the centroids also need to move, which will lead to the clusters changing again, and so on, until all the datapoints end up in the right cluster with a centroid that reflects the mean value of all those points.
Let's see what happens when we try the same with our "DEBT" and "HOUSES" data.
xxxxxxxxxxVimeoVideo("713919177", h="102616b1c3", width=600)xxxxxxxxxx**Task 6.2.6:** Run the cell below to display the `SCFClusterWidget`.Task 6.2.6: Run the cell below to display the SCFClusterWidget.
xxxxxxxxxxscfc = SCFClusterWidget(x=df["DEBT"], y=df["HOUSES"], n_clusters=3)scfc.show()VBox(children=(IntSlider(value=0, continuous_update=False, description='Step:', max=10), Output(layout=Layout(…
xxxxxxxxxx## Iterate2.1. Iterate¶
Now that you've had a chance to play around with the process a little bit, let's get into how to build a model that does the same thing.
xxxxxxxxxxVimeoVideo("713919157", h="0b2c3c95f2", width=600)xxxxxxxxxx**Task 6.2.7:** Build a `KMeans` model, assign it to the variable name `model`, and fit it to the training data `X`. Task 6.2.7: Build a KMeans model, assign it to the variable name model, and fit it to the training data X.
xxxxxxxxxx<div class="alert alert-info" role="alert">random_state for all your models in this lesson.
xxxxxxxxxx# Build modelmodel = KMeans(n_clusters=3, random_state=42)# Fit model to datamodel.fit(X)KMeans(n_clusters=3, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, random_state=42)
xxxxxxxxxxAnd there it is. 42 datapoints spread across three clusters. Let's grab the labels that the model has assigned to the data points so we can start making a new visualization.And there it is. 42 datapoints spread across three clusters. Let's grab the labels that the model has assigned to the data points so we can start making a new visualization.
xxxxxxxxxxVimeoVideo("713919137", h="7eafe805ff", width=600)xxxxxxxxxx**Task 6.2.8:** Extract the labels that your `model` created during training and assign them to the variable `labels`.Task 6.2.8: Extract the labels that your model created during training and assign them to the variable labels.
xxxxxxxxxxlabels = model.labels_labels[:10]array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
xxxxxxxxxxUsing the labels we just extracted, let's recreate the scatter plot from before, this time we'll color each point according to the cluster to which the model assigned it.Using the labels we just extracted, let's recreate the scatter plot from before, this time we'll color each point according to the cluster to which the model assigned it.
xxxxxxxxxxVimeoVideo("713919104", h="2f6d4285f1", width=600)xxxxxxxxxx**Task 6.2.9:** Recreate the "Home Value vs. Household Debt" scatter plot you made above, but with two changes. First, use seaborn to create the plot. Second, pass your `labels` to the `hue` argument, and set the `palette` argument to `"deep"`. Task 6.2.9: Recreate the "Home Value vs. Household Debt" scatter plot you made above, but with two changes. First, use seaborn to create the plot. Second, pass your labels to the hue argument, and set the palette argument to "deep".
xxxxxxxxxx# Plot "HOUSES" vs "DEBT" with hue=labelsns.scatterplot(x=df["DEBT"]/1e6, y=df["HOUSES"]/1e6, hue=labels, palette="deep")plt.xlabel("Household Debt [$1M]")plt.ylabel("Home Value [$1M]")plt.title("Credit Fearful: Home Value vs. Household Debt");xxxxxxxxxxNice! Each cluster has its own color. The centroids are still missing, so let's pull those out.Nice! Each cluster has its own color. The centroids are still missing, so let's pull those out.
xxxxxxxxxxVimeoVideo("713919087", h="9b8635c9a8", width=600)xxxxxxxxxx**Task 6.2.10:** Extract the centroids that your `model` created during training, and assign them to the variable `centroids`. Task 6.2.10: Extract the centroids that your model created during training, and assign them to the variable centroids.
xxxxxxxxxxcentroids = model.cluster_centers_centroidsarray([[ 116150.29328698, 91017.57766674],
[34484000. , 18384100. ],
[11666666.66666667, 5065800. ]])xxxxxxxxxxLet's add the centroids to the graph.Let's add the centroids to the graph.
xxxxxxxxxxVimeoVideo("713919002", h="08cba14f6b", width=600)xxxxxxxxxx**Task 6.2.11:** Recreate the seaborn "Home Value vs. Household Debt" scatter plot you just made, but with one difference: Add the `centroids` to the plot. Be sure to set the centroids color to `"gray"`.Task 6.2.11: Recreate the seaborn "Home Value vs. Household Debt" scatter plot you just made, but with one difference: Add the centroids to the plot. Be sure to set the centroids color to "gray".
xxxxxxxxxx# Plot "HOUSES" vs "DEBT", add centroidssns.scatterplot(x=df["DEBT"]/1e6, y=df["HOUSES"]/1e6, hue=labels, palette="deep")plt.scatter( x = centroids[:,1] / 1e6, y = centroids[:,0] / 1e6, marker = "*", s=150)plt.xlabel("Household Debt [$1M]")plt.ylabel("Home Value [$1M]")plt.title("Credit Fearful: Home Value vs. Household Debt");xxxxxxxxxxThat looks great, but let's not pat ourselves on the back just yet. Even though our graph makes it *look* like the clusters are correctly assigned but, as data scientists, we need a numerical evaluation. The data we're using is pretty clear-cut, but if things were a little more muddled, we'd want to run some calculations to make sure we got everything right.That looks great, but let's not pat ourselves on the back just yet. Even though our graph makes it look like the clusters are correctly assigned but, as data scientists, we need a numerical evaluation. The data we're using is pretty clear-cut, but if things were a little more muddled, we'd want to run some calculations to make sure we got everything right.
There are two metrics that we'll use to evaluate our clusters. We'll start with inertia, which measure the distance between the points within the same cluster.
xxxxxxxxxxVimeoVideo("713918749", h="bfc741b1e7", width=600)xxxxxxxxxx<div class="alert alert-info" role="alert">Question: What do those double bars in the equation mean?
Answer: It's the L2 norm, that is, the non-negative Euclidean distance between each datapoint and its centroid. In Python, it would be something like sqrt((x1-c)**2 + (x2-c)**2) + ...).
Many thanks to Aghogho Esuoma Monorien for his comment in the forum! 🙏
xxxxxxxxxx**Task 6.2.12:** Extract the inertia for your `model` and assign it to the variable `inertia`.Task 6.2.12: Extract the inertia for your model and assign it to the variable inertia.
xxxxxxxxxxinertia = model.inertia_print("Inertia (3 clusters):", inertia)Inertia (3 clusters): 939554010797059.4
xxxxxxxxxxThe "best" inertia is 0, and our score is pretty far from that. Does that mean our model is "bad?" Not necessarily. Inertia is a measurement of distance (like mean absolute error from Project 2). This means that the unit of measurement for inertia depends on the unit of measurement of our x- and y-axes. And since `"DEBT"` and `"HOUSES"` are measured in tens of millions of dollars, it's not surprising that inertia is so large. The "best" inertia is 0, and our score is pretty far from that. Does that mean our model is "bad?" Not necessarily. Inertia is a measurement of distance (like mean absolute error from Project 2). This means that the unit of measurement for inertia depends on the unit of measurement of our x- and y-axes. And since "DEBT" and "HOUSES" are measured in tens of millions of dollars, it's not surprising that inertia is so large.
However, it would be helpful to have metric that was easier to interpret, and that's where silhouette score comes in. Silhouette score measures the distance between different clusters. It ranges from -1 (the worst) to 1 (the best), so it's easier to interpret than inertia. WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxxVimeoVideo("713918501", h="0462c4784a", width=600)xxxxxxxxxx**Task 6.2.13:** Calculate the silhouette score for your model and assign it to the variable `ss`.Task 6.2.13: Calculate the silhouette score for your model and assign it to the variable ss.
xxxxxxxxxxss = silhouette_score(X, model.labels_)print("Silhouette Score (3 clusters):", ss)Silhouette Score (3 clusters): 0.9768842462944348
xxxxxxxxxxOutstanding! 0.976 is pretty close to 1, so our model has done a good job at identifying 3 clusters that are far away from each other.Outstanding! 0.976 is pretty close to 1, so our model has done a good job at identifying 3 clusters that are far away from each other.
xxxxxxxxxxIt's important to remember that these performance metrics are the result of the number of clusters we told our model to create. In unsupervised learning, the number of clusters is hyperparameter that you set before training your model. So what would happen if we change the number of clusters? Will it lead to better performance? Let's try!It's important to remember that these performance metrics are the result of the number of clusters we told our model to create. In unsupervised learning, the number of clusters is hyperparameter that you set before training your model. So what would happen if we change the number of clusters? Will it lead to better performance? Let's try!
xxxxxxxxxxVimeoVideo("713918420", h="e16f3735c7", width=600)xxxxxxxxxx**Task 6.2.14:** Use a `for` loop to build and train a K-Means model where `n_clusters` ranges from 2 to 12 (inclusive). Each time a model is trained, calculate the inertia and add it to the list `inertia_errors`, then calculate the silhouette score and add it to the list `silhouette_scores`.Task 6.2.14: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Each time a model is trained, calculate the inertia and add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.
xxxxxxxxxx- [Write a `for` loop in Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-for-Loops)xxxxxxxxxxn_clusters = range(2,13)inertia_errors = []silhouette_scores = []# Add `for` loop to train model and calculate inertia, silhouette score.for k in n_clusters: model= KMeans(n_clusters=k, random_state=42) model.fit(X) inertia_errors.append(model.inertia_) silhouette_scores.append(silhouette_score(X, model.labels_) ) print("Inertia:", inertia_errors)print()print("Silhouette Scores:", silhouette_scores)Inertia: [3018038313336857.5, 939554010797059.4, 546098841715646.25, 309310386410913.3, 235243397481784.3, 182225729179703.53, 150670779013790.4, 114321995931021.89, 100340259483919.02, 86229997033602.88, 74757234072100.36] Silhouette Scores: [0.9855099957519555, 0.9768842462944348, 0.9490311483406091, 0.839330043242819, 0.7287406719898627, 0.726989114305748, 0.7263840026889208, 0.7335125606476427, 0.692157992955073, 0.6949309528556856, 0.6951831031001252]
xxxxxxxxxxNow that we have both performance metrics for several different settings of `n_clusters`, let's make some line plots to see the relationship between the number of clusters in a model and its inertia and silhouette scores.Now that we have both performance metrics for several different settings of n_clusters, let's make some line plots to see the relationship between the number of clusters in a model and its inertia and silhouette scores.
xxxxxxxxxxVimeoVideo("713918224", h="32ff34ffa1", width=600)xxxxxxxxxx**Task 6.2.15:** Create a line plot that shows the values of `inertia_errors` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Inertia"`, and use the title `"K-Means Model: Inertia vs Number of Clusters"`.Task 6.2.15: Create a line plot that shows the values of inertia_errors as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Inertia", and use the title "K-Means Model: Inertia vs Number of Clusters".
xxxxxxxxxx# Plot `inertia_errors` by `n_clusters`plt.plot(n_clusters, inertia_errors)plt.xlabel("n_clusters")plt.ylabel("inertia_errors")Text(0, 0.5, 'inertia_errors')
xxxxxxxxxxWhat we're seeing here is that, as the number of clusters increases, inertia goes down. In fact, we could get inertia to 0 if we told our model to make 4,623 clusters (the same number of observations in `X`), but those clusters wouldn't be helpful to us.What we're seeing here is that, as the number of clusters increases, inertia goes down. In fact, we could get inertia to 0 if we told our model to make 4,623 clusters (the same number of observations in X), but those clusters wouldn't be helpful to us.
The trick with choosing the right number of clusters is to look for the "bend in the elbow" for this plot. In other words, we want to pick the point where the drop in inertia becomes less dramatic and the line begins to flatten out. In this case, it looks like the sweet spot is 4 or 5.
Let's see what the silhouette score looks like.
xxxxxxxxxxVimeoVideo("713918153", h="3f3a1312d2", width=600)xxxxxxxxxx**Task 6.2.16:** Create a line plot that shows the values of `silhouette_scores` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Silhouette Score"`, and use the title `"K-Means Model: Silhouette Score vs Number of Clusters"`.Task 6.2.16: Create a line plot that shows the values of silhouette_scores as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Silhouette Score", and use the title "K-Means Model: Silhouette Score vs Number of Clusters".
xxxxxxxxxx# Plot `silhouette_scores` vs `n_clusters`plt.plot(n_clusters, silhouette_scores)[<matplotlib.lines.Line2D at 0x7fb7c4de06a0>]
xxxxxxxxxxNote that, in contrast to our inertia plot, bigger is better. So we're not looking for a "bend in the elbow" but rather a number of clusters for which the silhouette score still remains high. We can see that silhouette score drops drastically beyond 4 clusters. Given this and what we saw in the inertia plot, it looks like the optimal number of clusters is 4. Note that, in contrast to our inertia plot, bigger is better. So we're not looking for a "bend in the elbow" but rather a number of clusters for which the silhouette score still remains high. We can see that silhouette score drops drastically beyond 4 clusters. Given this and what we saw in the inertia plot, it looks like the optimal number of clusters is 4.
Now that we've decided on the final number of clusters, let's build a final model.
xxxxxxxxxxVimeoVideo("713918108", h="e6aa88569e", width=600)xxxxxxxxxx**Task 6.2.17:** Build and train a new k-means model named `final_model`. Use the information you gained from the two plots above to set an appropriate value for the `n_clusters` argument. Once you've built and trained your model, submit it to the grader for evaluation. Task 6.2.17: Build and train a new k-means model named final_model. Use the information you gained from the two plots above to set an appropriate value for the n_clusters argument. Once you've built and trained your model, submit it to the grader for evaluation.
xxxxxxxxxx# Build modelfinal_model = KMeans(n_clusters=4, random_state=42)# Fit model to datafinal_model.fit(X)KMeans(n_clusters=4, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=42)
xxxxxxxxxxwqet_grader.grade("Project 6 Assessment", "Task 6.2.17", final_model)Excellent work.
Score: 1
xxxxxxxxxx(In case you're wondering, we don't need an *Evaluate* section in this notebook because we don't have any test data to evaluate our model with.)(In case you're wondering, we don't need an Evaluate section in this notebook because we don't have any test data to evaluate our model with.)
xxxxxxxxxx# Communicate3. Communicate¶
xxxxxxxxxxVimeoVideo("713918073", h="3929b58011", width=600)xxxxxxxxxx**Task 6.2.18:** Create one last "Home Value vs. Household Debt" scatter plot that shows the clusters that your `final_model` has assigned to the training data. Task 6.2.18: Create one last "Home Value vs. Household Debt" scatter plot that shows the clusters that your final_model has assigned to the training data.
xxxxxxxxxx# Plot "HOUSES" vs "DEBT" with final_model labelssns.scatterplot(x= df["DEBT"]/1e6, y= df["HOUSES"]/1e6, hue=final_model.labels_)plt.xlabel("Household Debt [$1M]")plt.ylabel("Home Value [$1M]")plt.title("Credit Fearful: Home Value vs. Household Debt");xxxxxxxxxxNice! You can see all four of our clusters, each differentiated from the rest by color.Nice! You can see all four of our clusters, each differentiated from the rest by color.
We're going to make one more visualization, converting the cluster analysis we just did to something a little more actionable: a side-by-side bar chart. In order to do that, we need to put our clustered data into a DataFrame.
xxxxxxxxxxVimeoVideo("713918023", h="110156bd98", width=600)xxxxxxxxxx**Task 6.2.19:** Create a DataFrame `xgb` that contains the mean `"DEBT"` and `"HOUSES"` values for each of the clusters in your `final_model`.Task 6.2.19: Create a DataFrame xgb that contains the mean "DEBT" and "HOUSES" values for each of the clusters in your final_model.
xxxxxxxxxxxgb = X.groupby(final_model.labels_).mean()xgb| HOUSES | DEBT | |
|---|---|---|
| 0 | 1.031872e+05 | 8.488629e+04 |
| 1 | 3.448400e+07 | 1.838410e+07 |
| 2 | 1.407400e+07 | 5.472800e+06 |
| 3 | 4.551429e+06 | 2.420929e+06 |
xxxxxxxxxxfinal_model.cluster_centers_array([[ 103187.22476563, 84886.28951384],
[34484000. , 18384100. ],
[14074000. , 5472800. ],
[ 4551428.57142857, 2420928.57142857]])xxxxxxxxxxBefore you move to the next task, print out the `cluster_centers_` for your `final_model`. Do you see any similarities between them and the DataFrame you just made? Why do you think that is?Before you move to the next task, print out the cluster_centers_ for your final_model. Do you see any similarities between them and the DataFrame you just made? Why do you think that is?
xxxxxxxxxxVimeoVideo("713917740", h="bcc496c2d9", width=600)xxxxxxxxxx**Task 6.2.20:** Create a side-by-side bar chart from `xgb` that shows the mean `"DEBT"` and `"HOUSES"` values for each of the clusters in your `final_model`. For readability, you'll want to divide the values in `xgb` by 1 million. Be sure to label the x-axis `"Cluster"`, the y-axis `"Value [$1 million]"`, and use the title `"Mean Home Value & Household Debt by Cluster"`.Task 6.2.20: Create a side-by-side bar chart from xgb that shows the mean "DEBT" and "HOUSES" values for each of the clusters in your final_model. For readability, you'll want to divide the values in xgb by 1 million. Be sure to label the x-axis "Cluster", the y-axis "Value [$1 million]", and use the title "Mean Home Value & Household Debt by Cluster".
xxxxxxxxxx# Create side-by-side bar chart of `xgb`xgb.plot(kind="bar")plt.xlabel("Cluster")plt.ylabel("Value [$1 million]")plt.title("Mean Home Value & Household Debt by Cluster");xxxxxxxxxx(xgb["DEBT"]/xgb["HOUSES"]).plot(kind= "bar")<AxesSubplot:>
xxxxxxxxxxIn this plot, we have our four clusters spread across the x-axis, and the dollar amounts for home value and household debt on the y-axis. In this plot, we have our four clusters spread across the x-axis, and the dollar amounts for home value and household debt on the y-axis.
The first thing to look at in this chart is the different mean home values for the five clusters. Clusters 0 represents households with small to moderate home values, clusters 2 and 3 have high home values, and cluster 1 has extremely high values.
The second thing to look at is the proportion of debt to home value. In clusters 1 and 3, this proportion is around 0.5. This suggests that these groups have a moderate amount of untapped equity in their homes. But for group 0, it's almost 1, which suggests that the largest source of household debt is their mortgage. Group 2 is unique in that they have the smallest proportion of debt to home value, around 0.4.
This information could be useful to financial institution that want to target customers with products that would appeal to them. For instance, households in group 0 might be interested in refinancing their mortgage to lower their interest rate. Group 2 households could be interested in a home equity line of credit because they have more equity in their homes. And the bankers, Bill Gates, and Beyoncés in group 1 might want white-glove personalized wealth management.
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>6.3. Clustering with Multiple Features</strong></font>6.3. Clustering with Multiple Features
xxxxxxxxxxIn the previous lesson, we built a K-Means model to create clusters of respondents to the Survey of Consumer Finances. We made our clusters by looking at two features only, but there are hundreds of features in the dataset that we didn't take into account and that could contain valuable information. In this lesson, we'll examine all the features, selecting five to create clusters with. After we build our model and choose an appropriate number of clusters, we'll learn how to visualize multi-dimensional clusters in a 2D scatter plot using something called principal component analysis (PCA). In the previous lesson, we built a K-Means model to create clusters of respondents to the Survey of Consumer Finances. We made our clusters by looking at two features only, but there are hundreds of features in the dataset that we didn't take into account and that could contain valuable information. In this lesson, we'll examine all the features, selecting five to create clusters with. After we build our model and choose an appropriate number of clusters, we'll learn how to visualize multi-dimensional clusters in a 2D scatter plot using something called principal component analysis (PCA).
xxxxxxxxxximport pandas as pdimport plotly.express as pximport wqet_graderfrom IPython.display import VimeoVideofrom scipy.stats.mstats import trimmed_varfrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCAfrom sklearn.metrics import silhouette_scorefrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerwqet_grader.init("Project 6 Assessment")xxxxxxxxxxVimeoVideo("714612789", h="f4f8c10683", width=600)xxxxxxxxxx# Prepare Data1. Prepare Data¶
xxxxxxxxxx## Import1.1. Import¶
xxxxxxxxxxWe spent some time in the last lesson zooming in on a useful subset of the SCF, and this time, we're going to zoom in even further. One of the persistent issues we've had with this dataset is that it includes some outliers in the form of ultra-wealthy households. This didn't make much of a difference for our last analysis, but it could pose a problem in this lesson, so we're going to focus on families with net worth under \\$2 million.We spent some time in the last lesson zooming in on a useful subset of the SCF, and this time, we're going to zoom in even further. One of the persistent issues we've had with this dataset is that it includes some outliers in the form of ultra-wealthy households. This didn't make much of a difference for our last analysis, but it could pose a problem in this lesson, so we're going to focus on families with net worth under \$2 million.
xxxxxxxxxxVimeoVideo("714612746", h="07dc57f72c", width=600)xxxxxxxxxx**Task 6.3.1:** Rewrite your `wrangle` function from the last lesson so that it returns a DataFrame of households whose net worth is less than \\$2 million and that have been turned down for credit or feared being denied credit in the past 5 years (see `"TURNFEAR"`). Task 6.3.1: Rewrite your wrangle function from the last lesson so that it returns a DataFrame of households whose net worth is less than \$2 million and that have been turned down for credit or feared being denied credit in the past 5 years (see "TURNFEAR").
xxxxxxxxxxdef wrangle(filepath): df = pd.read_csv(filepath) mask =(df["TURNFEAR"]==1) & (df["NETWORTH"] < 2e6) df=df[mask] return dfxxxxxxxxxxdf = wrangle("data/SCFP2019.csv.gz")print(df.shape)df.head()(4418, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2 | 21 | 3790.476607 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 6 | 2 | 22 | 3798.868505 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 3 | 2 | 2 |
| 7 | 2 | 23 | 3799.468393 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 8 | 2 | 24 | 3788.076005 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 9 | 2 | 25 | 3793.066589 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
5 rows × 351 columns
xxxxxxxxxx## Explore1.2. Explore¶
xxxxxxxxxxIn this lesson, we want to make clusters using more than two features, but which of the 351 features should we choose? Often times, this decision will be made for you. For example, a stakeholder could give you a list of the features that are most important to them. If you don't have that limitation, though, another way to choose the best features for clustering is to determine which numerical features have the largest **variance**. That's what we'll do here. In this lesson, we want to make clusters using more than two features, but which of the 351 features should we choose? Often times, this decision will be made for you. For example, a stakeholder could give you a list of the features that are most important to them. If you don't have that limitation, though, another way to choose the best features for clustering is to determine which numerical features have the largest variance. That's what we'll do here.
xxxxxxxxxxVimeoVideo("714612679", h="040facf6e2", width=600)xxxxxxxxxx**Task 6.3.2:** Calculate the variance for all the features in `df`, and create a Series `top_ten_var` with the 10 features with the largest variance.Task 6.3.2: Calculate the variance for all the features in df, and create a Series top_ten_var with the 10 features with the largest variance.
xxxxxxxxxx# Calculate variance, get 10 largest featurestop_ten_var = df.var().sort_values().tail(10)top_ten_varPLOAN1 1.140894e+10 ACTBUS 1.251892e+10 BUS 1.256643e+10 KGTOTAL 1.346475e+10 DEBT 1.848252e+10 NHNFIN 2.254163e+10 HOUSES 2.388459e+10 NETWORTH 4.847029e+10 NFIN 5.713939e+10 ASSET 8.303967e+10 dtype: float64
xxxxxxxxxxAs usual, it's harder to make sense of a list like this than it would be if we visualized it, so let's make a graph. As usual, it's harder to make sense of a list like this than it would be if we visualized it, so let's make a graph.
xxxxxxxxxxVimeoVideo("714612647", h="5ecf36a0db", width=600)xxxxxxxxxx**Task 6.3.3:** Use plotly express to create a horizontal bar chart of `top_ten_var`. Be sure to label your x-axis `"Variance"`, the y-axis `"Feature"`, and use the title `"SCF: High Variance Features"`.Task 6.3.3: Use plotly express to create a horizontal bar chart of top_ten_var. Be sure to label your x-axis "Variance", the y-axis "Feature", and use the title "SCF: High Variance Features".
xxxxxxxxxx# Create horizontal bar chart of `top_ten_var`fig = px.bar(x=top_ten_var, y= top_ten_var.index, title="SCF: High variance features")fig.update_layout(xaxis_title="Variance", yaxis_title = "Feature")fig.show()xxxxxxxxxxOne thing that we've seen throughout this project is that many of the wealth indicators are highly skewed, with a few outlier households having enormous wealth. Those outliers can affect our measure of variance. Let's see if that's the case with one of the features from `top_five_var`.One thing that we've seen throughout this project is that many of the wealth indicators are highly skewed, with a few outlier households having enormous wealth. Those outliers can affect our measure of variance. Let's see if that's the case with one of the features from top_five_var.
xxxxxxxxxxVimeoVideo("714612615", h="9ae23890fc", width=600)xxxxxxxxxx**Task 6.3.4:** Use plotly express to create a horizontal boxplot of `"NHNFIN"` to determine if the values are skewed. Be sure to label the x-axis `"Value [$]"`, and use the title `"Distribution of Non-home, Non-Financial Assets"`.Task 6.3.4: Use plotly express to create a horizontal boxplot of "NHNFIN" to determine if the values are skewed. Be sure to label the x-axis "Value [$]", and use the title "Distribution of Non-home, Non-Financial Assets".
xxxxxxxxxx# Create a boxplot of `NHNFIN`fig = px.box( data_frame= df, x="KGTOTAL")fig.show()xxxxxxxxxxWhoa! The dataset is massively right-skewed because of the huge outliers on the right side of the distribution. Even though we already excluded households with a high net worth with our `wrangle` function, the variance is still being distorted by some extreme outliers.Whoa! The dataset is massively right-skewed because of the huge outliers on the right side of the distribution. Even though we already excluded households with a high net worth with our wrangle function, the variance is still being distorted by some extreme outliers.
The best way to deal with this is to look at the trimmed variance, where we remove extreme values before calculating variance. We can do this using the trimmed_variance function from the SciPy library.
xxxxxxxxxxVimeoVideo("714612570", h="b1be8fb750", width=600)xxxxxxxxxx**Task 6.3.5:** Calculate the trimmed variance for the features in `df`. Your calculations should not include the top and bottom 10% of observations. Then create a Series `top_ten_trim_var` with the 10 features with the largest variance.Task 6.3.5: Calculate the trimmed variance for the features in df. Your calculations should not include the top and bottom 10% of observations. Then create a Series top_ten_trim_var with the 10 features with the largest variance.
xxxxxxxxxx# Calculate trimmed variancetop_ten_trim_var = df.apply(trimmed_var, limits=(0.1,0.1)).sort_values().tail(10)top_ten_trim_varWAGEINC 5.550737e+08 HOMEEQ 7.338377e+08 NH_MORT 1.333125e+09 MRTHEL 1.380468e+09 PLOAN1 1.441968e+09 DEBT 3.089865e+09 NETWORTH 3.099929e+09 HOUSES 4.978660e+09 NFIN 8.456442e+09 ASSET 1.175370e+10 dtype: float64
xxxxxxxxxxOkay! Now that we've got a better set of numbers, let's make another bar graph.Okay! Now that we've got a better set of numbers, let's make another bar graph.
xxxxxxxxxxVimeoVideo("714611188", h="d762a98b1e", width=600)xxxxxxxxxx**Task 6.3.6:** Use plotly express to create a horizontal bar chart of `top_ten_trim_var`. Be sure to label your x-axis `"Trimmed Variance"`, the y-axis `"Feature"`, and use the title `"SCF: High Variance Features"`.Task 6.3.6: Use plotly express to create a horizontal bar chart of top_ten_trim_var. Be sure to label your x-axis "Trimmed Variance", the y-axis "Feature", and use the title "SCF: High Variance Features".
xxxxxxxxxx# Create horizontal bar chart of `top_ten_trim_var` fig = px.bar(x=top_ten_trim_var, y= top_ten_trim_var.index, title="SCF: High variance features")fig.update_layout(xaxis_title="Variance", yaxis_title = "Feature")fig.show()xxxxxxxxxxThere are three things to notice in this plot. First, the variances have decreased a lot. In our previous chart, the x-axis went up to \\$80 billion; this one goes up to \\$12 billion. Second, the top 10 features have changed a bit. All the features relating to business ownership (`"...BUS"`) are gone. Finally, we can see that there are big differences in variance from feature to feature. For example, the variance for `"WAGEINC"` is around than \\$500 million, while the variance for `"ASSET"` is nearly \\$12 billion. In other words, these features have completely different scales. This is something that we'll need to address before we can make good clusters. There are three things to notice in this plot. First, the variances have decreased a lot. In our previous chart, the x-axis went up to \$80 billion; this one goes up to \$12 billion. Second, the top 10 features have changed a bit. All the features relating to business ownership ("...BUS") are gone. Finally, we can see that there are big differences in variance from feature to feature. For example, the variance for "WAGEINC" is around than \$500 million, while the variance for "ASSET" is nearly \$12 billion. In other words, these features have completely different scales. This is something that we'll need to address before we can make good clusters.
xxxxxxxxxxVimeoVideo("714611161", h="61dee490ee", width=600)xxxxxxxxxx**Task 6.3.7:** Generate a list `high_var_cols` with the column names of the five features with the highest trimmed variance.Task 6.3.7: Generate a list high_var_cols with the column names of the five features with the highest trimmed variance.
xxxxxxxxxxhigh_var_cols = top_ten_trim_var.tail(5).index.tolist()high_var_cols['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']
xxxxxxxxxx## Split1.3. Split¶
xxxxxxxxxxNow that we've gotten our data to a place where we can use it, we can follow the steps we've used before to build a model, starting with a feature matrix.Now that we've gotten our data to a place where we can use it, we can follow the steps we've used before to build a model, starting with a feature matrix.
xxxxxxxxxxVimeoVideo("714611148", h="f7fbd4bcc5", width=600)xxxxxxxxxx**Task 6.3.8:** Create the feature matrix `X`. It should contain the five columns in `high_var_cols`.Task 6.3.8: Create the feature matrix X. It should contain the five columns in high_var_cols.
xxxxxxxxxxX = df[high_var_cols]print("X shape:", X.shape)X.head()X shape: (4418, 5)
| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| 5 | 12200.0 | -6710.0 | 0.0 | 3900.0 | 5490.0 |
| 6 | 12600.0 | -4710.0 | 0.0 | 6300.0 | 7890.0 |
| 7 | 15300.0 | -8115.0 | 0.0 | 5600.0 | 7185.0 |
| 8 | 14100.0 | -2510.0 | 0.0 | 10000.0 | 11590.0 |
| 9 | 15400.0 | -5715.0 | 0.0 | 8100.0 | 9685.0 |
xxxxxxxxxx# Build Model2. Build Model¶
xxxxxxxxxx## Iterate2.1. Iterate¶
xxxxxxxxxxDuring our EDA, we saw that we had a scale issue among our features. That issue can make it harder to cluster the data, so we'll need to fix that to help our analysis along. One strategy we can use is **standardization**, a statistical method for putting all the variables in a dataset on the same scale. Let's explore how that works here. Later, we'll incorporate it into our model pipeline. During our EDA, we saw that we had a scale issue among our features. That issue can make it harder to cluster the data, so we'll need to fix that to help our analysis along. One strategy we can use is standardization, a statistical method for putting all the variables in a dataset on the same scale. Let's explore how that works here. Later, we'll incorporate it into our model pipeline.
xxxxxxxxxxVimeoVideo("714611113", h="3671a603b5", width=600)xxxxxxxxxx**Task 6.3.9:** Create a DataFrame `X_summary` with the mean and standard deviation for all the features in `X`.Task 6.3.9: Create a DataFrame X_summary with the mean and standard deviation for all the features in X.
xxxxxxxxxxX_summary = X.aggregate(["mean","std"])X_summary| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| mean | 72701.258488 | 76387.768900 | 74530.805794 | 117330.637166 | 149089.027388 |
| std | 135950.435529 | 220159.684405 | 154546.415791 | 239038.471726 | 288166.040553 |
xxxxxxxxxxThat's the information we need to standardize our data, so let's make it happen.That's the information we need to standardize our data, so let's make it happen.
xxxxxxxxxxVimeoVideo("714611056", h="670f6bdb78", width=600)xxxxxxxxxx**Task 6.3.10:** Create a `StandardScaler` transformer, use it to transform the data in `X`, and then put the transformed data into a DataFrame named `X_scaled`.Task 6.3.10: Create a StandardScaler transformer, use it to transform the data in X, and then put the transformed data into a DataFrame named X_scaled.
- What's standardization?
- Transform data using a transformer in scikit-learn.WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxx# Instantiate transformerss = StandardScaler()# Transform `X`X_scaled_data = ss.fit_transform(X)# Put `X_scaled_data` into DataFrameX_scaled = pd.DataFrame(X_scaled_data, columns= X.columns)print("X_scaled shape:", X_scaled.shape)X_scaled.head()X_scaled shape: (4418, 5)
| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| 0 | -0.445075 | -0.377486 | -0.48231 | -0.474583 | -0.498377 |
| 1 | -0.442132 | -0.368401 | -0.48231 | -0.464541 | -0.490047 |
| 2 | -0.422270 | -0.383868 | -0.48231 | -0.467470 | -0.492494 |
| 3 | -0.431097 | -0.358407 | -0.48231 | -0.449061 | -0.477206 |
| 4 | -0.421534 | -0.372966 | -0.48231 | -0.457010 | -0.483818 |
xxxxxxxxxxAs you can see, all five of the features use the same scale now. But just to make sure, let's take a look at their mean and standard deviation.As you can see, all five of the features use the same scale now. But just to make sure, let's take a look at their mean and standard deviation.
xxxxxxxxxxVimeoVideo("714611032", h="1ed03c46eb", width=600)xxxxxxxxxx**Task 6.3.11:** Create a DataFrame `X_scaled_summary` with the mean and standard deviation for all the features in `X_scaled`.Task 6.3.11: Create a DataFrame X_scaled_summary with the mean and standard deviation for all the features in X_scaled.
xxxxxxxxxxX_scaled_summary = X_scaled.aggregate(["mean", "std"]).astype(int)X_scaled_summary| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| mean | 0 | 0 | 0 | 0 | 0 |
| std | 1 | 1 | 1 | 1 | 1 |
xxxxxxxxxxAnd that's what it should look like. Remember, standardization takes all the features and scales them so that they all have a mean of 0 and a standard deviation of 1.And that's what it should look like. Remember, standardization takes all the features and scales them so that they all have a mean of 0 and a standard deviation of 1.
xxxxxxxxxxNow that we can compare all our data on the same scale, we can start making clusters. Just like we did last time, we need to figure out how many clusters we should have.Now that we can compare all our data on the same scale, we can start making clusters. Just like we did last time, we need to figure out how many clusters we should have.
xxxxxxxxxxVimeoVideo("714610976", h="82f32af967", width=600)xxxxxxxxxx**Task 6.3.12:** Use a `for` loop to build and train a K-Means model where `n_clusters` ranges from 2 to 12 (inclusive). Your model should include a `StandardScaler`. Each time a model is trained, calculate the inertia and add it to the list `inertia_errors`, then calculate the silhouette score and add it to the list `silhouette_scores`.Task 6.3.12: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Your model should include a StandardScaler. Each time a model is trained, calculate the inertia and add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.
xxxxxxxxxxn_clusters = range(2, 13)inertia_errors = []silhouette_scores = []# Add `for` loop to train model and calculate inertia, silhouette score.for k in n_clusters: model = make_pipeline( StandardScaler(), KMeans(n_clusters=k, random_state=42) ) model.fit(X) inertia_errors.append(model.named_steps["kmeans"].inertia_) silhouette_scores.append( silhouette_score(X, model.named_steps["kmeans"].labels_) ) print("Inertia:", inertia_errors[:3])print()print("Silhouette Scores:", silhouette_scores[:3])Inertia: [11028.058082607145, 7190.526303575355, 5924.997726868041] Silhouette Scores: [0.7464502937083215, 0.7044601307791996, 0.6962653079183132]
xxxxxxxxxxJust like last time, let's create an elbow plot to see how many clusters we should use. Just like last time, let's create an elbow plot to see how many clusters we should use.
xxxxxxxxxxVimeoVideo("714610940", h="bacf42a282", width=600)xxxxxxxxxx**Task 6.3.13:** Use plotly express to create a line plot that shows the values of `inertia_errors` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Inertia"`, and use the title `"K-Means Model: Inertia vs Number of Clusters"`.Task 6.3.13: Use plotly express to create a line plot that shows the values of inertia_errors as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Inertia", and use the title "K-Means Model: Inertia vs Number of Clusters".
xxxxxxxxxx# Create line plot of `inertia_errors` vs `n_clusters`fig = px.line( x= n_clusters, y = inertia_errors, title="K-Means Model: Inertia vs Number of Clusters")fig,update_lfig.show()xxxxxxxxxxYou can see that the line starts to flatten out around 4 or 5 clusters.You can see that the line starts to flatten out around 4 or 5 clusters.
xxxxxxxxxx<div class="alert alert-block alert-info">Note: We ended up using 5 clusters last time, too, but that's because we're working with very similar data. 5 clusters isn't always going to be the right choice for this type of analysis, as we'll see below.
xxxxxxxxxxLet's make another line plot based on the silhouette scores.Let's make another line plot based on the silhouette scores.
xxxxxxxxxxVimeoVideo("714610912", h="01961ee57a", width=600)xxxxxxxxxx**Task 6.3.14:** Use plotly express to create a line plot that shows the values of `silhouette_scores` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Silhouette Score"`, and use the title `"K-Means Model: Silhouette Score vs Number of Clusters"`.Task 6.3.14: Use plotly express to create a line plot that shows the values of silhouette_scores as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Silhouette Score", and use the title "K-Means Model: Silhouette Score vs Number of Clusters".
xxxxxxxxxx# Create a line plot of `silhouette_scores` vs `n_clusters`fig = px.line( x= n_clusters, y = silhouette_scores, title="K-Means Model: Inertia vs Number of Clusters")fig.show()xxxxxxxxxxThis one's a little less straightforward, but we can see that the best silhouette scores occur when there are 3 or 4 clusters. This one's a little less straightforward, but we can see that the best silhouette scores occur when there are 3 or 4 clusters.
Putting the information from this plot together with our inertia plot, it seems like the best setting for n_clusters will be 4.
xxxxxxxxxxVimeoVideo("714610883", h="a6a0431b02", width=600)xxxxxxxxxx**Task 6.3.15:** Build and train a new k-means model named `final_model`. Use the information you gained from the two plots above to set an appropriate value for the `n_clusters` argument. Once you've built and trained your model, submit it to the grader for evaluation.Task 6.3.15: Build and train a new k-means model named final_model. Use the information you gained from the two plots above to set an appropriate value for the n_clusters argument. Once you've built and trained your model, submit it to the grader for evaluation.
xxxxxxxxxxfinal_model = make_pipeline( StandardScaler(), KMeans(n_clusters=4 ,random_state=42))final_model.fit(X)Pipeline(steps=[('standardscaler', StandardScaler()),
('kmeans', KMeans(n_clusters=4, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
('kmeans', KMeans(n_clusters=4, random_state=42))])StandardScaler()
KMeans(n_clusters=4, random_state=42)
xxxxxxxxxxWhen you're confident in your model, submit it to the grader.When you're confident in your model, submit it to the grader.
xxxxxxxxxxwqet_grader.grade("Project 6 Assessment", "Task 6.3.14", final_model)Python master 😁
Score: 1
xxxxxxxxxx# Communicate3. Communicate¶
xxxxxxxxxxIt's time to let everyone know how things turned out. Let's start by grabbing the labels.It's time to let everyone know how things turned out. Let's start by grabbing the labels.
xxxxxxxxxxVimeoVideo("714610862", h="69ff3fb2c8", width=600)xxxxxxxxxx**Task 6.3.16:** Extract the labels that your `final_model` created during training and assign them to the variable `labels`.Task 6.3.16: Extract the labels that your final_model created during training and assign them to the variable labels.
xxxxxxxxxxlabels = final_model.named_steps["kmeans"].labels_print(labels[:5])[0 0 0 0 0]
xxxxxxxxxxWe're going to make a visualization, so we need to create a new DataFrame to work with.We're going to make a visualization, so we need to create a new DataFrame to work with.
xxxxxxxxxxVimeoVideo("714610842", h="008a463aca", width=600)xxxxxxxxxx**Task 6.3.17:** Create a DataFrame `xgb` that contains the mean values of the features in `X` for each of the clusters in your `final_model`.Task 6.3.17: Create a DataFrame xgb that contains the mean values of the features in X for each of the clusters in your final_model.
xxxxxxxxxxxgb = X.groupby(labels).mean()xgb| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| 0 | 26551.075439 | 13676.153182 | 13745.637777 | 2.722605e+04 | 4.022723e+04 |
| 1 | 218112.818182 | 174713.441558 | 257403.246753 | 3.305884e+05 | 3.928263e+05 |
| 2 | 116160.779817 | 965764.155963 | 264339.449541 | 7.800611e+05 | 1.081925e+06 |
| 3 | 732937.575758 | 760397.575758 | 826136.363636 | 1.276227e+06 | 1.493335e+06 |
xxxxxxxxxxNow that we have a DataFrame, let's make a bar chart and see how our clusters differ. Now that we have a DataFrame, let's make a bar chart and see how our clusters differ.
xxxxxxxxxxVimeoVideo("714610772", h="e118407ff1", width=600)xxxxxxxxxx**Task 6.3.18:** Use plotly express to create a side-by-side bar chart from `xgb` that shows the mean of the features in `X` for each of the clusters in your `final_model`. Be sure to label the x-axis `"Cluster"`, the y-axis `"Value [$]"`, and use the title `"Mean Household Finances by Cluster"`.Task 6.3.18: Use plotly express to create a side-by-side bar chart from xgb that shows the mean of the features in X for each of the clusters in your final_model. Be sure to label the x-axis "Cluster", the y-axis "Value [$]", and use the title "Mean Household Finances by Cluster".
xxxxxxxxxx# Create side-by-side bar chart of `xgb`fig = px.bar( xgb, barmode= "group", title="Mean Household Finances by Cluster")fig.update_layoutfig.show()xxxxxxxxxxRemember that our clusters are based partially on `NETWORTH`, which means that the households in the 0 cluster have the smallest net worth, and the households in the 2 cluster have the highest. Based on that, there are some interesting things to unpack here.Remember that our clusters are based partially on NETWORTH, which means that the households in the 0 cluster have the smallest net worth, and the households in the 2 cluster have the highest. Based on that, there are some interesting things to unpack here.
First, take a look at the DEBT variable. You might think that it would scale as net worth increases, but it doesn't. The lowest amount of debt is carried by the households in cluster 2, even though the value of their houses (shown in green) is roughly the same. You can't really tell from this data what's going on, but one possibility might be that the people in cluster 2 have enough money to pay down their debts, but not quite enough money to leverage what they have into additional debts. The people in cluster 3, by contrast, might not need to worry about carrying debt because their net worth is so high.
Finally, since we started out this project looking at home values, take a look at the relationship between DEBT and HOUSES. The value of the debt for the people in cluster 0 is higher than the value of their houses, suggesting that most of the debt being carried by those people is tied up in their mortgages — if they own a home at all. Contrast that with the other three clusters: the value of everyone else's debt is lower than the value of their homes.
So all that's pretty interesting, but it's different from what we did last time, right? At this point in the last lesson, we made a scatter plot. This was a straightforward task because we only worked with two features, so we could plot the data points in two dimensions. But now X has five dimensions! How can we plot this to give stakeholders a sense of our clusters?
Since we're working with a computer screen, we don't have much of a choice about the number of dimensions we can use: it's got to be two. So, if we're going to do anything like the scatter plot we made before, we'll need to take our 5-dimensional data and change it into something we can look at in 2 dimensions.
xxxxxxxxxxVimeoVideo("714610665", h="19c9f7bf7f", width=600)xxxxxxxxxx**Task 6.3.19:** Create a `PCA` transformer, use it to reduce the dimensionality of the data in `X` to 2, and then put the transformed data into a DataFrame named `X_pca`. The columns of `X_pca` should be named `"PC1"` and `"PC2"`.Task 6.3.19: Create a PCA transformer, use it to reduce the dimensionality of the data in X to 2, and then put the transformed data into a DataFrame named X_pca. The columns of X_pca should be named "PC1" and "PC2".
xxxxxxxxxx# Instantiate transformerpca = PCA(n_components=2, random_state=42)# Transform `X`X_t = pca.fit_transform(X)# Put `X_t` into DataFrameX_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])print("X_pca shape:", X_pca.shape)X_pca.head()X_pca shape: (4418, 2)
| PC1 | PC2 | |
|---|---|---|
| 0 | -221525.424530 | -22052.273003 |
| 1 | -217775.100722 | -22851.358068 |
| 2 | -219519.642175 | -19023.646333 |
| 3 | -212195.720367 | -22957.107039 |
| 4 | -215540.507551 | -20259.749306 |
xxxxxxxxxxSo there we go: our five dimensions have been reduced to two. Let's make a scatter plot and see what we get.So there we go: our five dimensions have been reduced to two. Let's make a scatter plot and see what we get.
xxxxxxxxxxVimeoVideo("714610491", h="755c66fe15", width=600)xxxxxxxxxx**Task 6.3.20:** Use plotly express to create a scatter plot of `X_pca` using seaborn. Be sure to color the data points using the labels generated by your `final_model`. Label the x-axis `"PC1"`, the y-axis `"PC2"`, and use the title `"PCA Representation of Clusters"`.Task 6.3.20: Use plotly express to create a scatter plot of X_pca using seaborn. Be sure to color the data points using the labels generated by your final_model. Label the x-axis "PC1", the y-axis "PC2", and use the title "PCA Representation of Clusters".
xxxxxxxxxx# Create scatter plot of `PC2` vs `PC1`fig = px.scatter( data_frame= X_pca, x ="PC1", y="PC2", color=labels)fig.show()xxxxxxxxxx<div class="alert alert-block alert-info">StandardScaler in your transformation of X. How does it change the clusters in your scatter plot?
xxxxxxxxxxOne limitation of this plot is that it's hard to explain what the axes here represent. In fact, both of them are a combination of the five features we originally had in `X`, which means this is pretty abstract. Still, it's the best way we have to show as much information as possible as an explanatory tool for people outside the data science community. One limitation of this plot is that it's hard to explain what the axes here represent. In fact, both of them are a combination of the five features we originally had in X, which means this is pretty abstract. Still, it's the best way we have to show as much information as possible as an explanatory tool for people outside the data science community.
So what does this graph mean? It means that we made four tightly-grouped clusters that share some key features. If we were presenting this to a group of stakeholders, it might be useful to show this graph first as a kind of warm-up, since most people understand how a two-dimensional object works. Then we could move on to a more nuanced analysis of the data.
Just something to keep in mind as you continue your data science journey.
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>6.4. Interactive Dashboard</strong></font>6.4. Interactive Dashboard
xxxxxxxxxxIn the last lesson, we built a model based on the highest-variance features in our dataset and created several visualizations to communicate our results. In this lesson, we're going to combine all of these elements into a dynamic web application that will allow users to choose their own features, build a model, and evaluate its performance through a graphic user interface. In other words, you'll create a tool that will allow anyone to build a model without code. In the last lesson, we built a model based on the highest-variance features in our dataset and created several visualizations to communicate our results. In this lesson, we're going to combine all of these elements into a dynamic web application that will allow users to choose their own features, build a model, and evaluate its performance through a graphic user interface. In other words, you'll create a tool that will allow anyone to build a model without code.
xxxxxxxxxx<div class="alert alert-block alert-warning">Warning: If you have issues with your app launching during this project, try restarting your kernel and re-running the notebook from the beginning. Go to Kernel > Restart Kernel and Clear All Outputs.
If that doesn't work, close the browser window for your virtual machine, and then relaunch it from the "Overview" section of the WQU learning platform.
xxxxxxxxxximport pandas as pdimport plotly.express as pximport wqet_graderfrom dash import Input, Output, dcc, htmlfrom IPython.display import VimeoVideofrom jupyter_dash import JupyterDashfrom scipy.stats.mstats import trimmed_varfrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCAfrom sklearn.metrics import silhouette_scorefrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerwqet_grader.init("Project 6 Assessment")JupyterDash.infer_jupyter_proxy_config()xxxxxxxxxxVimeoVideo("715724401", h="062cb7d8cb", width=600)xxxxxxxxxx# Prepare Data1. Prepare Data¶
xxxxxxxxxxAs always, we'll start by bringing our data into the project using a `wrangle` function.As always, we'll start by bringing our data into the project using a wrangle function.
xxxxxxxxxx## Import1.1. Import¶
xxxxxxxxxxVimeoVideo("715724313", h="711e785135", width=600)xxxxxxxxxx**Task 6.4.1:** Complete the `wrangle` function below, using the docstring as a guide. Then use your function to read the file `"data/SCFP2019.csv.gz"` into a DataFrame. Task 6.4.1: Complete the wrangle function below, using the docstring as a guide. Then use your function to read the file "data/SCFP2019.csv.gz" into a DataFrame.
xxxxxxxxxxdef wrangle(filepath): """Read SCF data file into ``DataFrame``. Returns only credit fearful households whose net worth is less than $2 million. Parameters ---------- filepath : str Location of CSV file. """ df= pd.read_csv(filepath) mask = (df["TURNFEAR"]==1 ) & ( df["NETWORTH"]<2e6 ) df = df[mask] return dfxxxxxxxxxxdf = wrangle("data/SCFP2019.csv.gz")print(df.shape)df.head()(4418, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2 | 21 | 3790.476607 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 6 | 2 | 22 | 3798.868505 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 3 | 2 | 2 |
| 7 | 2 | 23 | 3799.468393 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 8 | 2 | 24 | 3788.076005 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 9 | 2 | 25 | 3793.066589 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
5 rows × 351 columns
xxxxxxxxxx# Build Dashboard2. Build Dashboard¶
xxxxxxxxxxIt's app time! There are lots of steps to follow here, but, by the end, you'll have made an interactive dashboard! We'll start with the layout.It's app time! There are lots of steps to follow here, but, by the end, you'll have made an interactive dashboard! We'll start with the layout.
xxxxxxxxxx## Application Layout2.1. Application Layout¶
xxxxxxxxxxFirst, instantiate the application.First, instantiate the application.
xxxxxxxxxxVimeoVideo("715724244", h="41e32f352f", width=600)xxxxxxxxxx**Task 6.4.2:** Instantiate a `JupyterDash` application and assign it to the variable name `app`.Task 6.4.2: Instantiate a JupyterDash application and assign it to the variable name app.
xxxxxxxxxxapp = JupyterDash(__name__)xxxxxxxxxxThen, let's give the app some labels.Then, let's give the app some labels.
xxxxxxxxxxVimeoVideo("715724173", h="21f2757631", width=600)xxxxxxxxxx**Task 6.4.3:** Start building the layout of your `app` by creating a `Div` object that has two child objects: an `H1` header that reads `"Survey of Consumer Finances"` and an `H2` header that reads `"High Variance Features"`.Task 6.4.3: Start building the layout of your app by creating a Div object that has two child objects: an H1 header that reads "Survey of Consumer Finances" and an H2 header that reads "High Variance Features".
xxxxxxxxxx<div class="alert alert-block alert-info">xxxxxxxxxxapp.layout = html.Div( [ # Application title html.H1("survey of Consumer Finances"), # Bar chart element html.H2("High Variance Features"), #bar chart dcc.Graph(id = "bar-chart"), dcc.RadioItems( options=[ {"label":"trimmed", "value":True}, {"label":"not-trimmed", "value":False} ], id="trimmed-button", value=True ), html.H2("K Means"), html.H3("No of K means"), dcc.Slider(min=2, max=12, step=1, id="k-slider"), html.Div(id="metrics"), # PCA graph dcc.Graph(id="pca-chart") ])xxxxxxxxxxEventually, the app we make will have several interactive parts. We'll start with a bar chart.Eventually, the app we make will have several interactive parts. We'll start with a bar chart.
xxxxxxxxxx## Variance Bar Chart2.2. Variance Bar Chart¶
xxxxxxxxxxNo matter how well-designed the chart might be, it won't show up in the app unless we add it to the dashboard as an object first.No matter how well-designed the chart might be, it won't show up in the app unless we add it to the dashboard as an object first.
xxxxxxxxxxVimeoVideo("715724086", h="e9ed963958", width=600)xxxxxxxxxx**Task 6.4.4:** Add a `Graph` object to your application's layout. Be sure to give it the id `"bar-chart"`.Task 6.4.4: Add a Graph object to your application's layout. Be sure to give it the id "bar-chart".
xxxxxxxxxxJust like we did last time, we need to retrieve the features with the highest variance.Just like we did last time, we need to retrieve the features with the highest variance.
xxxxxxxxxxVimeoVideo("715724816", h="80ec24d3d6", width=600)xxxxxxxxxx**Task 6.4.5:** Create a `get_high_var_features` function that returns the five highest-variance features in a DataFrame. Use the docstring for guidance. Task 6.4.5: Create a get_high_var_features function that returns the five highest-variance features in a DataFrame. Use the docstring for guidance.
xxxxxxxxxxdef get_high_var_features(trimmed = True, return_feat_names = False): """Returns the five highest-variance features of ``df``. Parameters ---------- trimmed : bool, default=True If ``True``, calculates trimmed variance, removing bottom and top 10% of observations. return_feat_names : bool, default=False If ``True``, returns feature names as a ``list``. If ``False`` returns ``Series``, where index is feature names and values are variances. """ #calculate variance if trimmed: top_five_features = df.apply(trimmed_var).sort_values().tail(5) else: top_five_features = df.var().sort_values().tail(5) if return_feat_names: top_five_features = top_five_features.index.tolist() return top_five_featuresxxxxxxxxxxget_high_var_features()xxxxxxxxxxNow that we have our top five features, we can use a function to return them in a bar chart.Now that we have our top five features, we can use a function to return them in a bar chart.
xxxxxxxxxxVimeoVideo("715724735", h="5238a5c518", width=600)xxxxxxxxxx**Task 6.4.6:** Create a `serve_bar_chart` function that returns a plotly express bar chart of the five highest-variance features. You should use `get_high_var_features` as a helper function. Follow the docstring for guidance.Task 6.4.6: Create a serve_bar_chart function that returns a plotly express bar chart of the five highest-variance features. You should use get_high_var_features as a helper function. Follow the docstring for guidance.
xxxxxxxxxx@app.callback( Output("bar-chart","figure"), Input("trimmed-button", "value"))def serve_bar_chart(trimmed = True): """Returns a horizontal bar chart of five highest-variance features. Parameters ---------- trimmed : bool, default=True If ``True``, calculates trimmed variance, removing bottom and top 10% of observations. """ top_five_features = get_high_var_features(trimmed= trimmed, return_feat_names=False) fig = px.bar(x= top_five_features, y = top_five_features.index, orientation= "h") fig.update_layout(xaxis_title="Variance", yaxis_title="Features") return figxxxxxxxxxxserve_bar_chart()xxxxxxxxxxNow, add the actual chart to the app.Now, add the actual chart to the app.
xxxxxxxxxxVimeoVideo("715724706", h="b672dd9202", width=600)xxxxxxxxxx**Task 6.4.7:** Use your `serve_bar_chart` function to add a bar chart to `"bar-chart"`. <span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>Task 6.4.7: Use your serve_bar_chart function to add a bar chart to "bar-chart". WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxxWhat we've done so far hasn't been all that different from other visualizations we've built in the past. Most of those charts have been static, but this one's going to be interactive. Let's add a radio button to give people something to play with.What we've done so far hasn't been all that different from other visualizations we've built in the past. Most of those charts have been static, but this one's going to be interactive. Let's add a radio button to give people something to play with.
xxxxxxxxxxVimeoVideo("715724662", h="957a128506", width=600)xxxxxxxxxx**Task 6.4.8:** Add a radio button to your application's layout. It should have two options: `"trimmed"` (which carries the value `True`) and `"not trimmed"` (which carries the value `False`). Be sure to give it the id `"trim-button"`.Task 6.4.8: Add a radio button to your application's layout. It should have two options: "trimmed" (which carries the value True) and "not trimmed" (which carries the value False). Be sure to give it the id "trim-button".
xxxxxxxxxxNow that we have code to create our bar chart, a place in our app to put it, and a button to manipulate it, let's connect all three elements.Now that we have code to create our bar chart, a place in our app to put it, and a button to manipulate it, let's connect all three elements.
xxxxxxxxxxVimeoVideo("715724573", h="7de7932f70", width=600)xxxxxxxxxx**Task 6.4.9:** Add a callback decorator to your `serve_bar_chart` function. The callback input should be the value returned by `"trim-button"`, and the output should be directed to `"bar-chart"`.Task 6.4.9: Add a callback decorator to your serve_bar_chart function. The callback input should be the value returned by "trim-button", and the output should be directed to "bar-chart".
xxxxxxxxxxWhen you're satisfied with your bar chart and radio buttons, scroll down to the bottom of this page and run the last block of code to see your work in action!When you're satisfied with your bar chart and radio buttons, scroll down to the bottom of this page and run the last block of code to see your work in action!
xxxxxxxxxx## K-means Slider and Metrics2.3. K-means Slider and Metrics¶
xxxxxxxxxxOkay, so now our app has a radio button, but that's only one thing for a viewer to interact with. Buttons are fun, but what if we made a slider to help people see what it means for the number of clusters to change. Let's do it!Okay, so now our app has a radio button, but that's only one thing for a viewer to interact with. Buttons are fun, but what if we made a slider to help people see what it means for the number of clusters to change. Let's do it!
Again, start by adding some objects to the layout.
xxxxxxxxxxVimeoVideo("715725482", h="88aa75b1e2", width=600)xxxxxxxxxx**Task 6.4.10:** Add two text objects to your application's layout: an `H2` header that reads `"K-means Clustering"` and an `H3` header that reads `"Number of Clusters (k)"`. Task 6.4.10: Add two text objects to your application's layout: an H2 header that reads "K-means Clustering" and an H3 header that reads "Number of Clusters (k)".
xxxxxxxxxxNow add the slider.Now add the slider.
xxxxxxxxxxVimeoVideo("715725430", h="5d24607b0c", width=600)xxxxxxxxxx**Task 6.4.11:** Add a slider to your application's layout. It should range from `2` to `12`. Be sure to give it the id `"k-slider"`.Task 6.4.11: Add a slider to your application's layout. It should range from 2 to 12. Be sure to give it the id "k-slider".
xxxxxxxxxxAnd add the whole thing to the app.And add the whole thing to the app.
xxxxxxxxxxVimeoVideo("715725405", h="8944b9c674", width=600)xxxxxxxxxx**Task 6.4.12:** Add a `Div` object to your applications layout. Be sure to give it the id `"metrics"`.Task 6.4.12: Add a Div object to your applications layout. Be sure to give it the id "metrics".
xxxxxxxxxxSo now we have a bar chart that changes with a radio button, and a slider that changes... well, nothing yet. Let's give it a model to work with.So now we have a bar chart that changes with a radio button, and a slider that changes... well, nothing yet. Let's give it a model to work with.
xxxxxxxxxxVimeoVideo("715725235", h="55229ebf88", width=600)xxxxxxxxxx**Task 6.4.13:** Create a `get_model_metrics` function that builds, trains, and evaluates `KMeans` model. Use the docstring for guidance. Note that, like the model you made in the last lesson, your model here should be a pipeline that includes a `StandardScaler`. Once you're done, submit your function to the grader.Task 6.4.13: Create a get_model_metrics function that builds, trains, and evaluates KMeans model. Use the docstring for guidance. Note that, like the model you made in the last lesson, your model here should be a pipeline that includes a StandardScaler. Once you're done, submit your function to the grader.
xxxxxxxxxxdef get_model_metrics(trimmed=True, k=2, return_metrics = False): """Build ``KMeans`` model based on five highest-variance features in ``df``. Parameters ---------- trimmed : bool, default=True If ``True``, calculates trimmed variance, removing bottom and top 10% of observations. k : int, default=2 Number of clusters. return_metrics : bool, default=False If ``False`` returns ``KMeans`` model. If ``True`` returns ``dict`` with inertia and silhouette score. """ #get high variance feature names features= get_high_var_features(trimmed= trimmed, return_feat_names= True) #create data X= df[features] #model model= make_pipeline(StandardScaler(), KMeans(n_clusters=k,random_state=42)) model.fit(X) if return_metrics: #calculate inertia_ i = model.named_steps["kmeans"].inertia_ #calculate silhouette ss= silhouette_score(X, model.named_steps["kmeans"].labels_) #put result into dictionary metrics={ "inertia":round(i), "silhouette":round(ss,3) } return metrics return modelxxxxxxxxxxwqet_grader.grade("Project 6 Assessment", "Task 6.4.13", get_model_metrics())Good work!
Score: 1
xxxxxxxxxxPart of what we want people to be able to do with the dashboard is see how the model's inertia and silhouette score when they move the slider around, so let's calculate those numbers...Part of what we want people to be able to do with the dashboard is see how the model's inertia and silhouette score when they move the slider around, so let's calculate those numbers...
xxxxxxxxxxVimeoVideo("715725137", h="124312b155", width=600)xxxxxxxxxx**Task 6.4.14:** Create a `serve_metrics` function. It should use your `get_model_metrics` to build and get the metrics for a model, and then return two objects: An `H3` header with the model's inertia and another `H3` header with the silhouette score.Task 6.4.14: Create a serve_metrics function. It should use your get_model_metrics to build and get the metrics for a model, and then return two objects: An H3 header with the model's inertia and another H3 header with the silhouette score.
xxxxxxxxxx@app.callback( Output("metrics", "children"), Input("trimmed-button", "value"), Input("k-slider", "value"))def serve_metrics(trimmed= True, k = 2): """Returns list of ``H3`` elements containing inertia and silhouette score for ``KMeans`` model. Parameters ---------- trimmed : bool, default=True If ``True``, calculates trimmed variance, removing bottom and top 10% of observations. k : int, default=2 Number of clusters. """# Get metrices metrics = get_model_metrics(trimmed= trimmed, k =k, return_metrics=True) text= [ html.H3(f"Inertia :{metrics['inertia']}"), html.H3(f"Silhouette Score: {metrics['silhouette']}") ] return textxxxxxxxxxx... and add them to the app.... and add them to the app.
xxxxxxxxxxVimeoVideo("715726075", h="ee0510063c", width=600)xxxxxxxxxxserve_metrics()xxxxxxxxxx**Task 6.4.15:** Add a callback decorator to your `serve_metrics` function. The callback inputs should be the values returned by `"trim-button"` and `"k-slider"`, and the output should be directed to `"metrics"`.Task 6.4.15: Add a callback decorator to your serve_metrics function. The callback inputs should be the values returned by "trim-button" and "k-slider", and the output should be directed to "metrics".
xxxxxxxxxx## PCA Scatter Plot2.4. PCA Scatter Plot¶
xxxxxxxxxxWe just made a slider that can change the inertia and silhouette scores, but not everyone will be able to understand what those changing numbers mean. Let's make a scatter plot to help them along.We just made a slider that can change the inertia and silhouette scores, but not everyone will be able to understand what those changing numbers mean. Let's make a scatter plot to help them along.
xxxxxxxxxxVimeoVideo("715726033", h="a658095771", width=600)xxxxxxxxxx**Task 6.4.16:** Add a `Graph` object to your application's layout. Be sure to give it the id `"pca-scatter"`.Task 6.4.16: Add a Graph object to your application's layout. Be sure to give it the id "pca-scatter".
xxxxxxxxxxJust like with the bar chart, we need to get the five highest-variance features of the data, so let's start with that.Just like with the bar chart, we need to get the five highest-variance features of the data, so let's start with that.
xxxxxxxxxxVimeoVideo("715725930", h="f957d27741", width=600)xxxxxxxxxx**Task 6.4.17:** Create a function `get_pca_labels` that subsets a DataFrame to its five highest-variance features, reduces those features to two dimensions using `PCA`, and returns a new DataFrame with three columns: `"PC1"`, `"PC2"`, and `"labels"`. This last column should be the labels determined by a `KMeans` model. Your function should you `get_high_var_features` and `get_model_metrics` as helpers. Refer to the docstring for guidance. Task 6.4.17: Create a function get_pca_labels that subsets a DataFrame to its five highest-variance features, reduces those features to two dimensions using PCA, and returns a new DataFrame with three columns: "PC1", "PC2", and "labels". This last column should be the labels determined by a KMeans model. Your function should you get_high_var_features and get_model_metrics as helpers. Refer to the docstring for guidance.
xxxxxxxxxxdef get_pca_labels(trimmed=True, k=2): """ ``KMeans`` labels. Parameters ---------- trimmed : bool, default=True If ``True``, calculates trimmed variance, removing bottom and top 10% of observations. k : int, default=2 Number of clusters. """ features=get_high_var_features(trimmed=trimmed,return_feat_names=True) X= df[features] transformer = PCA(n_components=2, random_state=42) X_t= transformer.fit_transform(X) X_pca= pd.DataFrame(X_t, columns=["PC1","PC2"]) model= get_model_metrics(trimmed= trimmed, k=k,return_metrics=False) X_pca["labels"] = model.named_steps["kmeans"].labels_.astype(str) X_pca.sort_values("labels", inplace= True) return X_pcaxxxxxxxxxxget_pca_labels()| PC1 | PC2 | labels | |
|---|---|---|---|
| 2208 | 889749.557584 | 467355.407904 | 0 |
| 1056 | 649765.113978 | 174994.130637 | 0 |
| 1057 | 649536.017166 | 176269.044416 | 0 |
| 1058 | 649536.017166 | 176269.044416 | 0 |
| 1059 | 649765.113978 | 174994.130637 | 0 |
| ... | ... | ... | ... |
| 1570 | -229796.419844 | -14301.836873 | 1 |
| 1571 | -229805.583716 | -14250.840322 | 1 |
| 1572 | -229814.747589 | -14199.843771 | 1 |
| 1611 | -213724.571420 | -39060.460885 | 1 |
| 4417 | 334191.956229 | -186450.064242 | 1 |
4418 rows × 3 columns
xxxxxxxxxxNow we can use those five features to make the actual scatter plot.Now we can use those five features to make the actual scatter plot.
xxxxxxxxxxVimeoVideo("715725877", h="21365c862f", width=600)xxxxxxxxxx**Task 6.4.18:** Create a function `serve_scatter_plot` that creates a 2D scatter plot of the data used to train a `KMeans` model, along with color-coded clusters. Use `get_pca_labels` as a helper. Refer to the docstring for guidance. Task 6.4.18: Create a function serve_scatter_plot that creates a 2D scatter plot of the data used to train a KMeans model, along with color-coded clusters. Use get_pca_labels as a helper. Refer to the docstring for guidance.
xxxxxxxxxx@app.callback( Output("pca-chart", "figure"), Input("trimmed-button","value"), Input("k-slider","value"))def serve_scatter_plot(trimmed=True, k=2): """Build 2D scatter plot of ``df`` with ``KMeans`` labels. Parameters ---------- trimmed : bool, default=True If ``True``, calculates trimmed variance, removing bottom and top 10% of observations. k : int, default=2 Number of clusters. """ fig = px.scatter( data_frame = get_pca_labels(trimmed= trimmed, k=k), x="PC1", y="PC2", color="labels", title = "PCA representation of clusters" ) fig.update_layout(xaxis_title="PC1") return figxxxxxxxxxxAgain, we finish up by adding some code to make the interactive elements of our app actually work.Again, we finish up by adding some code to make the interactive elements of our app actually work.
xxxxxxxxxxVimeoVideo("715725777", h="4b3ecacb85", width=600)xxxxxxxxxx**Task 6.4.19:** Add a callback decorator to your `serve_scatter_plot` function. The callback inputs should be the values returned by `"trim-button"` and `"k-slider"`, and the output should be directed to `"pca-scatter"`.Task 6.4.19: Add a callback decorator to your serve_scatter_plot function. The callback inputs should be the values returned by "trim-button" and "k-slider", and the output should be directed to "pca-scatter".
xxxxxxxxxx## Application Deployment2.5. Application Deployment¶
xxxxxxxxxxOnce you're feeling good about all the work we just did, run the cell and watch the app come to life! Once you're feeling good about all the work we just did, run the cell and watch the app come to life!
xxxxxxxxxx**Task 6.4.20:** Run the cell below to deploy your application. 😎Task 6.4.20: Run the cell below to deploy your application. 😎
xxxxxxxxxx<div class="alert alert-block alert-info">xxxxxxxxxx<div class="alert alert-block alert-warning">Warning: If you have issues with your app launching during this project, try restarting your kernel and re-running the notebook from the beginning. Go to Kernel > Restart Kernel and Clear All Outputs.
If that doesn't work, close the browser window for your virtual machine, and then relaunch it from the "Overview" section of the WQU learning platform.
xxxxxxxxxxapp.run_server(host="0.0.0.0", mode="external")Dash app running on https://vm.wqu.edu/proxy/8050/
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>6.5. Small Business Owners in the United States🇺🇸</strong></font>6.5. Small Business Owners in the United States🇺🇸
xxxxxxxxxxIn this assignment, you're going to focus on business owners in the United States. You'll start by examining some demographic characteristics of the group, such as age, income category, and debt vs home value. Then you'll select high-variance features, and create a clustering model to divide small business owners into subgroups. Finally, you'll create some visualizations to highlight the differences between these subgroups. Good luck! 🍀In this assignment, you're going to focus on business owners in the United States. You'll start by examining some demographic characteristics of the group, such as age, income category, and debt vs home value. Then you'll select high-variance features, and create a clustering model to divide small business owners into subgroups. Finally, you'll create some visualizations to highlight the differences between these subgroups. Good luck! 🍀
wqet_grader.init("Project 6 Assessment")from sklearn.preprocessing import StandardScalerxxxxxxxxxx# Prepare DataPrepare Data¶
xxxxxxxxxx## ImportImport¶
xxxxxxxxxxLet's start by bringing our data into the assignment.Let's start by bringing our data into the assignment.
xxxxxxxxxx**Task 6.5.1:** Read the file `"data/SCFP2019.csv.gz"` into the DataFrame `df`.Task 6.5.1: Read the file "data/SCFP2019.csv.gz" into the DataFrame df.
df = pd.read_csv("data/SCFP2019.csv.gz")df shape: (28885, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 11 | 6119.779308 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 2 | 10 | 6 | 6 | 3 | 3 |
| 1 | 1 | 12 | 4712.374912 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
| 2 | 1 | 13 | 5145.224455 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
| 3 | 1 | 14 | 5297.663412 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 2 | 6 | 2 | 1 | 10 | 4 | 4 | 2 | 2 |
| 4 | 1 | 15 | 4761.812371 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
5 rows × 351 columns
wqet_grader.grade("Project 6 Assessment", "Task 6.5.1", list(df.shape))You got it. Dance party time! 🕺💃🕺💃
Score: 1
xxxxxxxxxx## ExploreExplore¶
xxxxxxxxxxAs mentioned at the start of this assignment, you're focusing on business owners. But what percentage of the respondents in `df` are business owners?As mentioned at the start of this assignment, you're focusing on business owners. But what percentage of the respondents in df are business owners?
xxxxxxxxxx**Task 6.5.2:** Calculate the proportion of respondents in `df` that are business owners, and assign the result to the variable `pct_biz_owners`. You'll need to review the documentation regarding the `"HBUS"` column to complete these tasks.Task 6.5.2: Calculate the proportion of respondents in df that are business owners, and assign the result to the variable pct_biz_owners. You'll need to review the documentation regarding the "HBUS" column to complete these tasks.
print("proportion of business owners in df:", prop_biz_owners)proportion of business owners in df: 0.2740176562229531
wqet_grader.grade("Project 6 Assessment", "Task 6.5.2", [prop_biz_owners])Python master 😁
Score: 1
xxxxxxxxxxIs the distribution of income different for business owners and non-business owners?Is the distribution of income different for business owners and non-business owners?
xxxxxxxxxx**Task 6.5.3:** Create a DataFrame `df_inccat` that shows the normalized frequency for income categories for business owners and non-business owners. Your final DataFrame should look something like this:Task 6.5.3: Create a DataFrame df_inccat that shows the normalized frequency for income categories for business owners and non-business owners. Your final DataFrame should look something like this:
HBUS INCCAT frequency
0 0 0-20 0.210348
1 0 21-39.9 0.198140
...
11 1 0-20 0.041188
df["INCCAT"].replace(inccat_dict)| HBUS | INCCAT | frequency | |
|---|---|---|---|
| 0 | 0 | 0-20 | 0.210348 |
| 1 | 0 | 21-39.9 | 0.198140 |
| 2 | 0 | 40-59.9 | 0.189080 |
| 3 | 0 | 60-79.9 | 0.186600 |
| 4 | 0 | 90-100 | 0.117167 |
| 5 | 0 | 80-89.9 | 0.098665 |
| 6 | 1 | 90-100 | 0.629438 |
| 7 | 1 | 60-79.9 | 0.119015 |
| 8 | 1 | 80-89.9 | 0.097410 |
| 9 | 1 | 40-59.9 | 0.071510 |
| 10 | 1 | 21-39.9 | 0.041440 |
| 11 | 1 | 0-20 | 0.041188 |
wqet_grader.grade("Project 6 Assessment", "Task 6.5.3", df_inccat)Yes! Your hard work is paying off.
Score: 1
xxxxxxxxxx**Task 6.5.4:** Using seaborn, create a side-by-side bar chart of `df_inccat`. Set `hue` to `"HBUS"`, and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis `"Income Category"`, the y-axis `"Frequency (%)"`, and use the title `"Income Distribution: Business Owners vs. Non-Business Owners"`.Task 6.5.4: Using seaborn, create a side-by-side bar chart of df_inccat. Set hue to "HBUS", and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis "Income Category", the y-axis "Frequency (%)", and use the title "Income Distribution: Business Owners vs. Non-Business Owners".
sns.barplot(x="INCCAT",y="frequency", hue="HBUS", data =df_inccat, order = inccat_dict.values()) wqet_grader.grade("Project 6 Assessment", "Task 6.5.4", file)That's the right answer. Keep it up!
Score: 1
xxxxxxxxxxWe looked at the relationship between home value and household debt in the context of the the credit fearful, but what about business owners? Are there notable differences between business owners and non-business owners?We looked at the relationship between home value and household debt in the context of the the credit fearful, but what about business owners? Are there notable differences between business owners and non-business owners?
xxxxxxxxxx**Task 6.5.5:** Using seaborn, create a scatter plot that shows `"HOUSES"` vs. `"DEBT"`. You should color the datapoints according to business ownership. Be sure to label the x-axis `"Household Debt"`, the y-axis `"Home Value"`, and use the title `"Home Value vs. Household Debt"`. Task 6.5.5: Using seaborn, create a scatter plot that shows "HOUSES" vs. "DEBT". You should color the datapoints according to business ownership. Be sure to label the x-axis "Household Debt", the y-axis "Home Value", and use the title "Home Value vs. Household Debt".
sns.scatterplot(x=df["HOUSES"], y= df["DEBT"], hue=df["HBUS"])xxxxxxxxxxFor the model building part of the assignment, you're going to focus on small business owners, defined as respondents who have a business and whose income does not exceed \\$500,000.For the model building part of the assignment, you're going to focus on small business owners, defined as respondents who have a business and whose income does not exceed \$500,000.
wqet_grader.grade("Project 6 Assessment", "Task 6.5.5", file)Yes! Your hard work is paying off.
Score: 1
xxxxxxxxxx**Task 6.5.6:** Create a new DataFrame `df_small_biz` that contains only business owners whose income is below \\$500,000.Task 6.5.6: Create a new DataFrame df_small_biz that contains only business owners whose income is below \$500,000.
mask = (df["HBUS"] == 1) & (df["INCOME"] < 500_000 )df_small_biz shape: (4364, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 80 | 17 | 171 | 7802.265717 | 1 | 62 | 4 | 12 | 4 | 1 | 0 | ... | 3 | 5 | 5 | 5 | 2 | 7 | 9 | 9 | 4 | 4 |
| 81 | 17 | 172 | 8247.536301 | 1 | 62 | 4 | 12 | 4 | 1 | 0 | ... | 3 | 5 | 5 | 5 | 2 | 7 | 9 | 9 | 4 | 4 |
| 82 | 17 | 173 | 8169.562719 | 1 | 62 | 4 | 12 | 4 | 1 | 0 | ... | 3 | 5 | 5 | 5 | 2 | 7 | 9 | 9 | 4 | 4 |
| 83 | 17 | 174 | 8087.704517 | 1 | 62 | 4 | 12 | 4 | 1 | 0 | ... | 3 | 5 | 5 | 5 | 2 | 7 | 9 | 9 | 4 | 4 |
| 84 | 17 | 175 | 8276.510048 | 1 | 62 | 4 | 12 | 4 | 1 | 0 | ... | 3 | 5 | 5 | 5 | 2 | 7 | 9 | 9 | 4 | 4 |
5 rows × 351 columns
wqet_grader.grade("Project 6 Assessment", "Task 6.5.6", list(df_small_biz.shape))Yes! Keep on rockin'. 🎸That's right.
Score: 1
xxxxxxxxxxWe saw that credit-fearful respondents were relatively young. Is the same true for small business owners?We saw that credit-fearful respondents were relatively young. Is the same true for small business owners?
xxxxxxxxxx**Task 6.5.7:** Create a histogram from the `"AGE"` column in `df_small_biz` with 10 bins. Be sure to label the x-axis `"Age"`, the y-axis `"Frequency (count)"`, and use the title `"Small Business Owners: Age Distribution"`. Task 6.5.7: Create a histogram from the "AGE" column in df_small_biz with 10 bins. Be sure to label the x-axis "Age", the y-axis "Frequency (count)", and use the title "Small Business Owners: Age Distribution".
plt.title("Small Business Owners: Age Distribution")xxxxxxxxxxSo, can we say the same thing about small business owners as we can about credit-fearful people?So, can we say the same thing about small business owners as we can about credit-fearful people?
wqet_grader.grade("Project 6 Assessment", "Task 6.5.7", file)Your submission doesn't match the expected result. Check the image below to see where your plot differs from the answer.
Score: 0
xxxxxxxxxxLet's take a look at the variance in the dataset.Let's take a look at the variance in the dataset.
xxxxxxxxxx**Task 6.5.8:** Calculate the variance for all the features in `df_small_biz`, and create a Series `top_ten_var` with the 10 features with the largest variance.Task 6.5.8: Calculate the variance for all the features in df_small_biz, and create a Series top_ten_var with the 10 features with the largest variance.
top_ten_var = df_small_biz.var().sort_values().tail(10)EQUITY 1.005088e+13 FIN 2.103228e+13 KGBUS 5.025210e+13 ACTBUS 5.405021e+13 BUS 5.606717e+13 KGTOTAL 6.120760e+13 NHNFIN 7.363197e+13 NFIN 9.244074e+13 NETWORTH 1.424450e+14 ASSET 1.520071e+14 dtype: float64
wqet_grader.grade("Project 6 Assessment", "Task 6.5.8", top_ten_var)Party time! 🎉🎉🎉
Score: 1
xxxxxxxxxxWe'll need to remove some outliers to avoid problems in our calculations, so let's trim them out.We'll need to remove some outliers to avoid problems in our calculations, so let's trim them out.
xxxxxxxxxx**Task 6.5.9:** Calculate the trimmed variance for the features in `df_small_biz`. Your calculations should not include the top and bottom 10% of observations. Then create a Series `top_ten_trim_var` with the 10 features with the largest variance.Task 6.5.9: Calculate the trimmed variance for the features in df_small_biz. Your calculations should not include the top and bottom 10% of observations. Then create a Series top_ten_trim_var with the 10 features with the largest variance.
top_ten_trim_var = df_small_biz.apply(trimmed_var, limits=(0.1,0.1)).sort_values().tail(10)EQUITY 1.177020e+11 KGBUS 1.838163e+11 FIN 3.588855e+11 KGTOTAL 5.367878e+11 ACTBUS 5.441806e+11 BUS 6.531708e+11 NHNFIN 1.109187e+12 NFIN 1.792707e+12 NETWORTH 3.726356e+12 ASSET 3.990101e+12 dtype: float64
wqet_grader.grade("Project 6 Assessment", "Task 6.5.9", top_ten_trim_var)Very impressive.
Score: 1
xxxxxxxxxxLet's do a quick visualization of those values.Let's do a quick visualization of those values.
xxxxxxxxxx**Task 6.5.10:** Use plotly express to create a horizontal bar chart of `top_ten_trim_var`. Be sure to label your x-axis `"Trimmed Variance [$]"`, the y-axis `"Feature"`, and use the title `"Small Business Owners: High Variance Features"`.Task 6.5.10: Use plotly express to create a horizontal bar chart of top_ten_trim_var. Be sure to label your x-axis "Trimmed Variance [$]", the y-axis "Feature", and use the title "Small Business Owners: High Variance Features".
fig = px.bar(x= top_ten_trim_var , y = top_ten_trim_var.index, title="Small Business Owners: High Variance Features") wqet_grader.grade("Project 6 Assessment", "Task 6.5.10", file)Python master 😁
Score: 1
xxxxxxxxxxBased on this graph, which five features have the highest variance?Based on this graph, which five features have the highest variance?
xxxxxxxxxx**Task 6.5.11:** Generate a list `high_var_cols` with the column names of the five features with the highest trimmed variance.Task 6.5.11: Generate a list high_var_cols with the column names of the five features with the highest trimmed variance.
high_var_cols = top_ten_trim_var.tail(5).index.to_list()['BUS', 'NHNFIN', 'NFIN', 'NETWORTH', 'ASSET']
wqet_grader.grade("Project 6 Assessment", "Task 6.5.11", high_var_cols)Yes! Your hard work is paying off.
Score: 1
xxxxxxxxxx## SplitSplit¶
xxxxxxxxxxLet's turn that list into a feature matrix.Let's turn that list into a feature matrix.
xxxxxxxxxx**Task 6.5.12:** Create the feature matrix `X`. It should contain the five columns in `high_var_cols`.Task 6.5.12: Create the feature matrix X. It should contain the five columns in high_var_cols.
X = df_small_biz[high_var_cols]X shape: (4364, 5)
wqet_grader.grade("Project 6 Assessment", "Task 6.5.12", list(X.shape))You're making this look easy. 😉
Score: 1
xxxxxxxxxx# Build ModelBuild Model¶
xxxxxxxxxxNow that our data is in order, let's get to work on the model.Now that our data is in order, let's get to work on the model.
xxxxxxxxxx## IterateIterate¶
xxxxxxxxxx**Task 6.5.13:** Use a `for` loop to build and train a K-Means model where `n_clusters` ranges from 2 to 12 (inclusive). Your model should include a `StandardScaler`. Each time a model is trained, calculate the inertia and add it to the list `inertia_errors`, then calculate the silhouette score and add it to the list `silhouette_scores`.Task 6.5.13: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Your model should include a StandardScaler. Each time a model is trained, calculate the inertia and add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.
xxxxxxxxxx<div class="alert alert-info" role="alert">42.
# Add `for` loop to train model and calculate inertia, silhouette score.Inertia: [5765.863949365048, 3070.4294488357455, 2220.292185089684, 1777.4635570665569, 1443.7860071034045, 1173.3701169574997, 1004.0082329287382, 892.7197264630449, 780.7646441851751, 678.9317940468646, 601.0107062352758] Silhouette Scores: [0.9542706303253067, 0.8446503900103915, 0.7422220122162623]
wqet_grader.grade("Project 6 Assessment", "Task 6.5.13", list(inertia_errors))Very impressive.
Score: 1
xxxxxxxxxxJust like we did in the previous module, we can start to figure out how many clusters we'll need with a line plot based on Inertia.Just like we did in the previous module, we can start to figure out how many clusters we'll need with a line plot based on Inertia.
xxxxxxxxxx**Task 6.5.14:** Use plotly express to create a line plot that shows the values of `inertia_errors` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Inertia"`, and use the title `"K-Means Model: Inertia vs Number of Clusters"`.Task 6.5.14: Use plotly express to create a line plot that shows the values of inertia_errors as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Inertia", and use the title "K-Means Model: Inertia vs Number of Clusters".
fig = px.line(y=inertia_errors,x= n_clusters,title="K-Means Model: Inertia vs Number of Clusters") wqet_grader.grade("Project 6 Assessment", "Task 6.5.14", file)Yes! Keep on rockin'. 🎸That's right.
Score: 1
xxxxxxxxxxAnd let's do the same thing with our Silhouette Scores.And let's do the same thing with our Silhouette Scores.
xxxxxxxxxx**Task 6.5.15:** Use plotly express to create a line plot that shows the values of `silhouette_scores` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Silhouette Score"`, and use the title `"K-Means Model: Silhouette Score vs Number of Clusters"`.Task 6.5.15: Use plotly express to create a line plot that shows the values of silhouette_scores as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Silhouette Score", and use the title "K-Means Model: Silhouette Score vs Number of Clusters".
fig = px.line(y= silhouette_scores,x= n_clusters,title="K-Means Model: Silhouette Score vs Number of Clusters") wqet_grader.grade("Project 6 Assessment", "Task 6.5.15", file)🥳
Score: 1
xxxxxxxxxxHow many clusters should we use? When you've made a decision about that, it's time to build the final model.How many clusters should we use? When you've made a decision about that, it's time to build the final model.
xxxxxxxxxx**Task 6.5.16:** Build and train a new k-means model named `final_model`. The number of clusters should be `3`.Task 6.5.16: Build and train a new k-means model named final_model. The number of clusters should be 3.
xxxxxxxxxx<div class="alert alert-info" role="alert">42.
KMeans(n_clusters=3, random_state=42)Pipeline(steps=[('standardscaler', StandardScaler()),
('kmeans', KMeans(n_clusters=3, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
('kmeans', KMeans(n_clusters=3, random_state=42))])StandardScaler()
KMeans(n_clusters=3, random_state=42)
# match_steps, match_hyperparameters, prune_hyperparameters should all be TrueCorrect.
Score: 1
xxxxxxxxxx# CommunicateCommunicate¶
xxxxxxxxxxExcellent! Let's share our work! Excellent! Let's share our work!
xxxxxxxxxx**Task 6.5.17:** Create a DataFrame `xgb` that contains the mean values of the features in `X` for the 3 clusters in your `final_model`.Task 6.5.17: Create a DataFrame xgb that contains the mean values of the features in X for the 3 clusters in your final_model.
labels = final_model.named_steps["kmeans"].labels_| BUS | NHNFIN | NFIN | NETWORTH | ASSET | |
|---|---|---|---|---|---|
| 0 | 736718 | 1002199 | 1487967 | 2076002 | 2281249 |
| 1 | 68744792 | 82021152 | 91696521 | 113484264 | 116752862 |
| 2 | 12161517 | 15676186 | 18291227 | 23100241 | 24226024 |
wqet_grader.grade("Project 6 Assessment", "Task 6.5.17", xgb)Wow, you're making great progress.
Score: 1
xxxxxxxxxxAs usual, let's make a visualization with the DataFrame.As usual, let's make a visualization with the DataFrame.
xxxxxxxxxx**Task 6.5.18:** Use plotly express to create a side-by-side bar chart from `xgb` that shows the mean of the features in `X` for each of the clusters in your `final_model`. Be sure to label the x-axis `"Cluster"`, the y-axis `"Value [$]"`, and use the title `"Small Business Owner Finances by Cluster"`.Task 6.5.18: Use plotly express to create a side-by-side bar chart from xgb that shows the mean of the features in X for each of the clusters in your final_model. Be sure to label the x-axis "Cluster", the y-axis "Value [$]", and use the title "Small Business Owner Finances by Cluster".
fig.write_image("images/6-5-18.png", scale=1, height=500, width=700) wqet_grader.grade("Project 6 Assessment", "Task 6.5.18", file)Python master 😁
Score: 1
xxxxxxxxxxRemember what we did with higher-dimension data last time? Let's do the same thing here.Remember what we did with higher-dimension data last time? Let's do the same thing here.
xxxxxxxxxx**Task 6.5.19:** Create a `PCA` transformer, use it to reduce the dimensionality of the data in `X` to 2, and then put the transformed data into a DataFrame named `X_pca`. The columns of `X_pca` should be named `"PC1"` and `"PC2"`.Task 6.5.19: Create a PCA transformer, use it to reduce the dimensionality of the data in X to 2, and then put the transformed data into a DataFrame named X_pca. The columns of X_pca should be named "PC1" and "PC2".
X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])X_pca shape: (4364, 2)
| PC1 | PC2 | |
|---|---|---|
| 0 | -6.220648e+06 | -503841.638840 |
| 1 | -6.222523e+06 | -503941.888901 |
| 2 | -6.220648e+06 | -503841.638839 |
| 3 | -6.224927e+06 | -504491.429465 |
| 4 | -6.221994e+06 | -503492.598399 |
wqet_grader.grade("Project 6 Assessment", "Task 6.5.19", X_pca)That's the right answer. Keep it up!
Score: 1
xxxxxxxxxxFinally, let's make a visualization of our final DataFrame.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>Finally, let's make a visualization of our final DataFrame.WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxx**Task 6.5.20:** Use plotly express to create a scatter plot of `X_pca` using seaborn. Be sure to color the data points using the labels generated by your `final_model`. Label the x-axis `"PC1"`, the y-axis `"PC2"`, and use the title `"PCA Representation of Clusters"`.Task 6.5.20: Use plotly express to create a scatter plot of X_pca using seaborn. Be sure to color the data points using the labels generated by your final_model. Label the x-axis "PC1", the y-axis "PC2", and use the title "PCA Representation of Clusters".
fig.write_image("images/6-5-20.png", scale=1, height=500, width=700) wqet_grader.grade("Project 6 Assessment", "Task 6.5.20", file)xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>6.6. Data Dictionary</strong></font>6.6. Data Dictionary
xxxxxxxxxx# About the Survey of Consumer Finances1. About the Survey of Consumer Finances¶
xxxxxxxxxxFrom the [US Federal Reserve](https://www.federalreserve.gov/econres/aboutscf.htm) website:From the US Federal Reserve website:
The Survey of Consumer Finances (SCF) is normally a triennial cross-sectional survey of U.S. families. The survey data include information on families’ balance sheets, pensions, income, and demographic characteristics. Information is also included from related surveys of pension providers and the earlier such surveys conducted by the Federal Reserve Board. No other study for the country collects comparable information. Data from the SCF are widely used, from analysis at the Federal Reserve and other branches of government to scholarly work at the major economic research centers.
xxxxxxxxxx# SCF Combined Extract Data2. SCF Combined Extract Data¶
xxxxxxxxxx| Feature | Description || Feature | Description |
|---|---|
ACTBUS |
Total value of actively managed business(es), 2019 dollars |
AGE |
Age of reference person |
AGECL |
Age group of the reference person |
ANNUIT |
Amount R would receive if they cashed in annuities, 2019 dollars |
ANYPEN |
Pension exists for either reference person or spouse |
ASSET |
Total value of assets held by household, 2019 dollars |
ASSETCAT |
Asset percentile groups |
BCALL |
Information used for borrowing decisions |
BDONT |
Information used for borrowing decisions |
BFINPLAN |
Information used for borrowing decisions |
BFINPRO |
Information used for borrowing decisions |
BFRIENDWORK |
Information used for borrowing decisions |
BINTERNET |
Information used for borrowing decisions |
BMAGZNEWS |
Information used for borrowing decisions |
BMAILADTV |
Information used for borrowing decisions |
BNKRUPLAST5 |
Household has declared bankruptcy in the past 5 years |
BOND |
Total value of directly held bonds held by household, 2019 dollars |
BOTHER |
Information used for borrowing decisions |
BPLANCJ |
Either reference person or spouse/partner has both types of pension plan on a current job |
BSELF |
Information used for borrowing decisions |
BSHOPGRDL |
Shopping for borrowing and credit |
BSHOPMODR |
Shopping for borrowing and credit |
BSHOPNONE |
Shopping for borrowing and credit |
BUS |
Total value of business(es) in which the household has either an active or nonactive interest, 2019 dollars |
BUSSEFARMINC |
Income from business, sole proprietorship, and farm, 2019 dollars |
BUSVEH |
Household has vehicle(s) owned by business |
CALL |
Total value of call accounts held by household, 2019 dollars |
CANTMANG |
Why no checking account |
CASEID |
Case ID (numeric) |
CASHLI |
Total cash value of whole life insurance held by household, 2019 dollars |
CCBAL |
Total value of credit card balances held by household, 2019 dollars |
CDS |
Total value of certificates of deposit held by household, 2019 dollars |
CHECKING |
Total value of checking accounts held by household, 2019 dollars |
CKCONNECTN |
Why chose main checking account institution |
CKCONVPAYRL |
Why chose main checking account institution |
CKLOCATION |
Why chose main checking account institution |
CKLONGTIME |
Why chose main checking account institution |
CKLOWFEEBAL |
Why chose main checking account institution |
CKMANYSVCS |
Why chose main checking account institution |
CKOTHCHOOSE |
Why chose main checking account institution |
CKPERSONAL |
Why chose main checking account institution |
CKRECOMFRND |
Why chose main checking account institution |
CKSAFETY |
Why chose main checking account institution |
COMUTF |
amount in combination and other mutual funds, 2019 dollars |
CONSPAY |
total monthly consumer debt payments, 2019 dollars |
CPI_DEFL |
Deflator Value |
CREDIT |
Why no checking account |
CURRPEN |
current value in pension, 2019 dollars |
DBPLANCJ |
Either reference person or spouse/partner has a defined benefit pension on a current job |
DBPLANT |
Either reference person or spouse/partner has DB plan on current job or some type of pension from a past job to be received in the future |
DCPLANCJ |
Either reference person or spouse/partner has any type of account-based plan on a current job |
DEBT |
Total value of debt held by household, 2019 dollars |
DEBT2INC |
Ratio of total debt to total income |
DEQ |
Total value of equity in directly held stocks, stock mutual funds, and combination mutual funds held by household, 2019 dollars |
DONTLIKE |
Why no checking account |
DONTWANT |
Why no checking account |
DONTWRIT |
Why no checking account |
EDCL |
Education category of reference person |
EDN_INST |
Total value of education loans held by household, 2019 dollars |
EDUC |
Highest completed grade by reference person |
EHCHKG |
people w/o checking accounts |
EMERGBORR |
Respondent would borrow money in a hypothetical financial emergency |
EMERGCUT |
Respondent would cut back spending in a hypothetical financial emergency |
EMERGPSTP |
Respondent would postpone payments in a hypothetical financial emergency |
EMERGSAV |
Respondent would spend out of savings in a hypothetical financial emergency |
EQUITINC |
ratio of equity to normal income |
EQUITY |
Total value of financial assets held by household that are invested in stock, 2019 dollars |
EXPENSHILO |
Households overall expenses over last 12 months |
FAMSTRUCT |
Family structure of household |
FARMBUS_KG |
capital gains on farm businesses, 2019 dollars |
FARMBUS |
compute value of business part of farm net of outstanding mortgages, 2019 dollars |
FEARDENIAL |
Household feared being denied credit in the past 5 years |
FIN |
Total value of financial assets held by household, 2019 dollars |
FINLIT |
Number of financial literacy questions answered correctly |
FOODAWAY |
Total amount spent on food away from home, annualized, 2019 dollars |
FOODDELV |
Total amount spent on food delivered to home, annualized, 2019 dollars |
FOODHOME |
Total amount spent on food at home, annualized, 2019 dollars |
FORECLLAST5 |
Respondent had a foreclosure in the last five years |
FUTPEN |
future pensions (accumulated in an account for the R/S), 2019 dollars |
GBMUTF |
amount in government bond mutual funds, 2019 dollars |
GOVTBND |
US government and government agency bonds and bills, 2019 dollars |
HBORRALT |
Respondent would borrow money from alternative sources in a hypothetical financial emergency |
HBORRCC |
Respondent would borrow money using a credit card in a hypothetical financial emergency |
HBORRFF |
Respondent would borrow money from friends or family in a hypothetical financial emergency |
HBORRFIN |
Respondent would borrow money from financial services in a hypothetical financial emergency |
HBROK |
have a brokerage account |
HBUS |
Have active or non-actively managed business(es) |
HCUTENT |
Respondent would postpone payments for entertainment in a hypothetical financial emergency |
HCUTFOOD |
Respondent would cut back on food purchases in a hypothetical financial emergency |
HCUTOTH |
Respondent would postpone other payments in a hypothetical financial emergency |
HDEBT |
Household has any debt |
HELOC_YN |
Currently borrowing on home equity line of credit |
HELOC |
Total value of home equity lines of credit secured by the primary residence held by the household, 2019 dollars |
HHSEX |
Gender of household reference person |
HLIQ |
Household has any checking, savings, money market or call accounts |
HMORT2 |
Have junior lien mortgage not used for purchase of primary residence |
HOMEEQ |
Total value of equity in primary residence of household, 2019 dollars |
HOUSECL |
Home-ownership category of household |
HOUSES |
Total value of primary residence of household, 2019 dollars |
HPAYDAY |
Household had a payday loan within the past year |
HPRIM_MORT |
Have first lien mortgage on primary residence |
HPSTPLN |
Respondent would postpone payments on loans in a hypothetical financial emergency |
HPSTPOTH |
Respondent would postpone other payments in a hypothetical financial emergency |
HPSTPPAY |
Respondent would postpone payments for purchases in a hypothetical financial emergency |
HSAVFIN |
Respondent would spend out of financial sources in a hypothetical financial emergency |
HSAVNFIN |
Respondent would spend out of non-financial sources in a hypothetical financial emergency |
HSEC_MORT |
Have junior lien mortgage on primary residence |
HSTOCKS |
have stocks? |
HTRAD |
traded in the past year |
ICALL |
Information used for investing decisions |
IDONT |
Information used for investing decisions |
IFINPLAN |
Information used for investing decisions |
IFINPRO |
Information used for investing decisions |
IFRIENDWORK |
Information used for investing decisions |
IINTERNET |
Information used for investing decisions |
IMAGZNEWS |
Information used for investing decisions |
IMAILADTV |
Information used for investing decisions |
INCCAT |
Income percentile groups |
INCOME |
Total amount of income of household, 2019 dollars |
INCPCTLECAT |
Alternate income percentile groups |
INCQRTCAT |
Income quartile groups |
INDCAT |
Industry classifications for reference person |
INSTALL |
Total value of installment loans held by household, 2019 dollars |
INTDIVINC |
Interest (taxable and nontaxable) and dividend income, 2019 dollars |
INTERNET |
Do business with financial institution via the Internet |
IOTHER |
Information used for investing decisions |
IRAKH |
Total value of IRA/Keogh accounts, 2019 dollars |
ISELF |
Information used for investing decisions |
ISHOPGRDL |
Shopping for saving and investments |
ISHOPMODR |
Shopping for saving and investments |
ISHOPNONE |
Shopping for saving and investments |
KGBUS |
Unrealized capital gains or losses on businesses, 2019 dollars |
KGHOUSE |
Unrealized capital gains or losses on the primary residence, 2019 dollars |
KGINC |
Capital gain or loss income, 2019 dollars |
KGINC |
Capital gain or loss income, 2019 dollars |
KGORE |
Unrealized capital gains or losses on other real estate, 2019 dollars |
KGSTMF |
Unrealized capital gains or losses on stocks and mutual funds, 2019 dollars |
KGTOTAL |
Total unrealized capital gains or losses for the household, 2019 dollars |
KIDS |
Total number of children in household |
KNOWL |
Respondent's knowledge of personal finances |
LATE |
Household had any late debt payments in last year |
LATE60 |
Household had any debt payments more than 60 days past due in last year |
LEASE |
have leased vehicle |
LEVRATIO |
Ratio of total debt to total assets |
LF |
Labor force participation of reference person |
LIFECL |
Life cycle of reference person |
LIQ |
Total value of all types of transactions accounts, 2019 dollars |
LLOAN1 |
Total balance of household loans where the lender is a commercial bank, 2019 dollars |
LLOAN10 |
Total balance of household loans where the lender is a store and/or a credit card, 2019 dollars |
LLOAN11 |
Total balance of household loans where the lender is a pension, 2019 dollars |
LLOAN12 |
Total balance of household loans where the lender is other, unclassifiable, or foreign, 2019 dollars |
LLOAN2 |
Total balance of household loans where the lender is saving and loan, 2019 dollars |
LLOAN3 |
Total balance of household loans where the lender is credit union, 2019 dollars |
LLOAN4 |
Total balance of household loans where the lender is finance, loan or leasing company, or inc debt consolidator, 2019 dollars |
LLOAN5 |
Total balance of household loans where the lender is a brokerage and/or life insurance company, 2019 dollars |
LLOAN6 |
Total balance of household loans where the lender is a real estate company, 2019 dollars |
LLOAN7 |
Total balance of household loans where the lender is an individual, 2019 dollars |
LLOAN8 |
Total balance of household loans where the lender is an other non-financial, 2019 dollars |
LLOAN9 |
Total balance of household loans where the lender is government, 2019 dollars |
MARRIED |
Marital status of reference person |
MINBAL |
Why no checking account |
MMA |
Total value of money market deposit and money market mutual fund accounts, 2019 dollars |
MMDA |
money market deposit accounts, 2019 dollars |
MMMF |
money market mutual funds, 2019 dollars |
MORT1 |
Amount owed on mortgage 1, 2019 dollars |
MORT2 |
Amount owed on mortgage 2, 2019 dollars |
MORT3 |
Amount owed on mortgage 3, 2019 dollars |
MORTBND |
mortgage-backed bonds, 2019 dollars |
MORTPAY |
total monthly mortgage payments, 2019 dollars |
MRTHEL |
Total value of debt secured by the primary residence held by household, 2019 dollars |
NBUSVEH |
Total number of business vehicles held by household |
NETWORTH |
Total net worth of household, 2019 dollars |
NEWCAR1 |
number of car/truck/SUV with model year no older than two years before the survey |
NEWCAR2 |
number of car/truck/SUV with model year no older than one year before the survey |
NFIN |
Total value of non-financial assets held by household, 2019 dollars |
NH_MORT |
Total value of mortgages and home equity loans secured by the primary residence held by household, 2019 dollars |
NHNFIN |
total non-financial assets excluding principal residences, 2019 dollars |
NINCCAT |
Normal income percentile groups |
NINCPCTLECAT |
Alternate Normal income percentile groups |
NINCQRTCAT |
Normal income quartile groups |
NLEASE |
number of leased vehicles |
NMMF |
Total value of directly held pooled investment funds held by household, 2019 dollars |
NNRESRE |
Total value of net equity in nonresidential real estate held by household, 2019 dollars |
NOCCBAL |
Household does not carry a balance on credit cards |
NOCHK |
Household has no checking account |
NOFINRISK |
Respondent not willing to take financial risk |
NOMONEY |
Why no checking account |
NONACTBUS |
Value of non-actively managed business(es), 2019 dollars |
NORMINC |
Household normal income, 2019 dollars |
NOTXBND |
tax-exempt bonds (state and local bonds), 2019 dollars |
NOWN |
number of owned vehicles |
NSTOCKS |
number different companies in which hold stock |
NTRAD |
number of trades per year |
NVEHIC |
total number of vehicles (owned and leased) |
NWCAT |
Net worth percentile groups |
NWPCTLECAT |
Alternate net worth percentile groups |
OBMUTF |
amount in other bond mutual funds, 2019 dollars |
OBND |
corporate and foreign bonds, 2019 dollars |
OCCAT1 |
Occupation categories for reference person |
OCCAT2 |
Occupation classification for reference person |
ODEBT |
Total value of other debts held by household, 2019 dollars |
OMUTF |
amount in other mutual funds, 2019 dollars |
ORESRE |
Total value of other residential real estate held by household, 2019 dollars |
OTH_INST |
Total value of other installment loans held by household, 2019 dollars |
OTHER |
Why no checking account |
OTHFIN |
Total value of other financial assets, 2019 dollars |
OTHLOC |
Total value of other lines of credit held by household, 2019 dollars |
OTHMA |
Total value of other managed assets held by household, 2019 dollars |
OTHNFIN |
Total value of other non-financial assets held by household, 2019 dollars |
OWN |
have an owned vehicle |
PAYEDU1 |
payments on first education loan, 2019 dollars |
PAYEDU2 |
payments on second education loan, 2019 dollars |
PAYEDU3 |
payments on third education loan, 2019 dollars |
PAYEDU4 |
payments on fourth education loan, 2019 dollars |
PAYEDU5 |
payments on fifth education loan, 2019 dollars |
PAYEDU6 |
payments on sixth education loan, 2019 dollars |
PAYEDU7 |
payments on seventh education loan, 2019 dollars |
PAYHI1 |
payments on first home improvement loan, 2019 dollars |
PAYHI2 |
payments on second home improvement loan, 2019 dollars |
PAYILN1 |
payments on first installment loan, 2019 dollars |
PAYILN2 |
payments on second installment loan, 2019 dollars |
PAYILN3 |
payments on third installment loan, 2019 dollars |
PAYILN4 |
payments on fourth installment loan, 2019 dollars |
PAYILN5 |
payments on fifth installment loan, 2019 dollars |
PAYILN6 |
payments on sixth installment loan, 2019 dollars |
PAYILN7 |
payments on seventh installment loan, 2019 dollars |
PAYINS |
payments on loans against insurance policies, 2019 dollars |
PAYLC1 |
payments on first land contract, 2019 dollars |
PAYLC2 |
payments on second land contract, 2019 dollars |
PAYLCO |
payments on other land contracts, 2019 dollars |
PAYLOC1 |
payments on first line of credit, 2019 dollars |
PAYLOC2 |
payments on second line of credit, 2019 dollars |
PAYLOC3 |
payments on third line of credit, 2019 dollars |
PAYLOCO |
payments on other lines of credit, 2019 dollars |
PAYMARG |
payments on margin loans, 2019 dollars |
PAYMORT1 |
payments on first mortgage, 2019 dollars |
PAYMORT2 |
payments on second mortgage, 2019 dollars |
PAYMORT3 |
payments on third mortgage, 2019 dollars |
PAYMORTO |
payments on other loans, 2019 dollars |
PAYORE1 |
payments on first other residential property, 2019 dollars |
PAYORE2 |
payments on second other residential property, 2019 dollars |
PAYORE3 |
payments on third other residential property, 2019 dollars |
PAYOREV |
payments on remaining other residential properties, 2019 dollars |
PAYPEN1 |
payments on loan against first pension plan not previously reported, 2019 dollars |
PAYPEN2 |
payments on loan against second pension plan not previously reported, 2019 dollars |
PAYPEN3 |
payments on loan against third pension plan not previously reported, 2019 dollars |
PAYPEN4 |
payments on loan against fourth pension plan not previously reported, 2019 dollars |
PAYPEN5 |
payments on loan against fifth pension plan not previously reported, 2019 dollars |
PAYPEN6 |
payments on loan against sixth pension plan not previously reported, 2019 dollars |
PAYVEH1 |
payments on first vehicle, 2019 dollars |
PAYVEH2 |
payments on second vehicle, 2019 dollars |
PAYVEH3 |
payments on third vehicle, 2019 dollars |
PAYVEH4 |
payments on fourth vehicle, 2019 dollars |
PAYVEHM |
payments on remaining vehicles, 2019 dollars |
PAYVEO1 |
payment on first other vehicle, 2019 dollars |
PAYVEO2 |
payment on second other vehicle, 2019 dollars |
PAYVEOM |
payment on remaining other vehicles, 2019 dollars |
PENACCTWD |
Withdrawals from IRAs and tax-deferred pension accounts, 2019 dollars |
PIR40 |
Household has a PIR higher than 40% |
PIRCONS |
ratio of monthly non-mortgage non-revolving consumer debt payments to monthly income |
PIRMORT |
ratio of monthly mortgage payments to monthly income |
PIRREV |
ratio of monthly revolving debt payments to monthly income |
PIRTOTAL |
Ratio of monthly debt payments to monthly income |
PLOAN1 |
Total value of aggregate loan balance by loan purpose |
PLOAN2 |
Total value of aggregate loan balance by loan purpose |
PLOAN3 |
Total value of aggregate loan balance by loan purpose |
PLOAN4 |
Total value of aggregate loan balance by loan purpose |
PLOAN5 |
Total value of aggregate loan balance by loan purpose |
PLOAN6 |
Total value of aggregate loan balance by loan purpose |
PLOAN7 |
Total value of aggregate loan balance by loan purpose |
PLOAN8 |
Total value of aggregate loan balance by loan purpose |
PREPAID |
Amount in prepaid card accounts, 2019 dollars |
PURCH1 |
First lien on primary residence used for purchase of primary residence |
PURCH2 |
Junior lien on primary residence used for purchase of primary residence |
RACE |
Race/ethnicity of respondent |
RACECL |
Class of race of respondent |
RACECL4 |
Alternate class of race of respondent |
REFIN_EVER |
Refinanced first lien mortgage on primary residence |
RENT |
Monthly rent, 2019 dollars |
RESDBT |
Total value of debt for other residential property held by households, 2019 dollars |
RETEQ |
Total value of equity in quasi-liquid retirement assets, 2019 dollars |
RETQLIQ |
Total value of quasi-liquid held by household, 2019 dollars |
REVPAY |
total monthly revolving debt payments, 2019 dollars |
SAVBND |
Total value of savings bonds held by household, 2019 dollars |
SAVED |
Indicator of whether the household saved over the past 12 months |
SAVING |
Total value of savings accounts held by household, 2019 dollars |
SAVRES1 |
Reason for saving |
SAVRES2 |
Reason for saving |
SAVRES3 |
Reason for saving |
SAVRES4 |
Reason for saving |
SAVRES5 |
Reason for saving |
SAVRES6 |
Reason for saving |
SAVRES7 |
Reason for saving |
SAVRES8 |
Reason for saving |
SAVRES9 |
Reason for saving |
SPENDLESS |
R would spend less if assets depreciated in value |
SPENDMOR |
R would spend more if assets appreciated in value |
SSRETINC |
Social security and pension income, 2019 dollars |
STMUTF |
amount in stock mutual funds, 2019 dollars |
STOCKS |
Total value of directly held stocks held by household, 2019 dollars |
SVCCHG |
Why no checking account |
TFBMUTF |
amount in tax-free bond mutual funds, 2019 dollars |
THRIFT |
Total value of account-type pension plans from R and spouse's current job, 2019 dollars |
TPAY |
Total value of monthly debt payments, 2019 dollars |
TRANSFOTHINC |
Unemployment, alimony/child support, TANF/food stamps/SSI, and other income, 2019 dollars |
TRUSTS |
Amount R would receive if they cashed in trusts, 2019 dollars |
TURNDOWN |
Household has been turned down for credit in the past 5 years |
TURNFEAR |
Household has been turned down for credit or feared being denied credit in the past 5 years |
VEH_INST |
Total value of vehicle loans held by household, 2019 dollars |
VEHIC |
Total value of all vehicles held by household, 2019 dollars |
VLEASE |
Total value of leased vehicles held by household, 2019 dollars |
WAGEINC |
Wage and salary income, 2019 dollars |
WGT |
Sample weight |
WHYNOCKG |
Reason household does not have a checking account |
WILSH |
Wilshire index of stock prices |
WSAVED |
spent more/same/less than income in past year |
X1 |
Case ID with implicate number |
XX1 |
Case ID |
Y1 |
Case ID with implicate number |
YEAR |
Survey Year |
YESFINRISK |
Respondent willing to take financial risk |
YY1 |
Case ID |
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-
Variables
Callstack
Breakpoints
Source
xxxxxxxxxx